Multimodal Representation Learning by Alternating Unimodal Adaptation

Xiaohui Zhang,Jaehong Yoon,Mohit Bansal,Huaxiu Yao

Published 2023 in Computer Vision and Pattern Recognition

ABSTRACT

Multimodal learning, which integrates data from diverse sensory modes, plays a pivotal role in artificial intelligence. However, existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning. resulting in suboptimal performance. To address this challenge, we propose MLA (Multimodal Learning with Alternating Uni-modal Adaptation). MLA reframes the conventional joint multimodal learning process by transforming it into an al-ternating unimodal learning process, thereby minimizing interference between modalities. Simultaneously, it captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. This optimization process is controlled by a gradient modi-fication mechanism to prevent the shared head from losing previously acquired information. During the inference phase, MLA utilizes a test-time uncertainty-based model fusion mechanism to integrate multimodal information. Extensive experiments are conducted on five diverse datasets, encom-passing scenarios with complete modalities and scenarios with missing modalities. These experiments demonstrate the superiority of MLA over competing prior approaches. Our code is available at https://github.com/Cecile-hi/MLA.

PUBLICATION RECORD

Publication year
2023
Venue
Computer Vision and Pattern Recognition
Publication date
2023-11-17
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR52733.2024.02592 arXiv 2311.10707
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
2024cited by this paper
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
2024cited by this paper
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
2024cited by this paper
Multimodal Clinical Trial Outcome Prediction with Large Language Models
2024cited by this paper
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
2023cited by this paper
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection
2023cited by this paper
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
2023cited by this paper
Adaptive Fake Audio Detection with Low-Rank Model Squeezing
2023cited by this paper
Provable Dynamic Fusion for Low-Quality Multimodal Data
2023influential reference
Self-Chained Image-Language Model for Video Localization and Question Answering
2023cited by this paper
On Uni-Modal Feature Learning in Supervised Multi-Modal Learning
2023cited by this paper
Robust Multimodal Sentiment Analysis via Tag Encoding of Uncertain Missing Modalities
2023cited by this paper
An Empirical Study of Multimodal Model Merging
2023cited by this paper
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
2023cited by this paper
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
2023cited by this paper
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
2023cited by this paper
M3Care: Learning with Missing Modalities in Multimodal Healthcare Data
2022cited by this paper
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
2022cited by this paper
GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation
2022cited by this paper
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
2022cited by this paper
Balanced Multimodal Learning via On-the-fly Gradient Modulation
2022influential reference
Are Multimodal Transformers Robust to Missing Modality?
2022cited by this paper
Multimodal Masked Autoencoders Learn Transferable Representations
2022cited by this paper
Contrastive Audio-Visual Masked Autoencoder
2022cited by this paper
Exploiting Modality-Invariant Feature for Robust Multimodal Emotion Recognition with Missing Modalities
2022cited by this paper
PMR: Prototypical Modal Rebalance for Multimodal Learning
2022cited by this paper
Uncertainty-based Fusion Netwok for Automatic Skin Lesion Diagnosis
2022cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis
2021cited by this paper
Learning to Balance the Learning Rates Between Various Modalities via Adaptive Tracking Factor
2021cited by this paper
Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities
2021cited by this paper
Modality-aware Mutual Learning for Multi-modal Medical Image Segmentation
2021cited by this paper
SMIL: Multimodal Learning with Severely Missing Modality
2021cited by this paper
Deep Partial Multi-View Learning
2020cited by this paper
On Modality Bias in the TVQA Dataset
2020cited by this paper
Improving Multimodal Accuracy Through Modality Pre-training and Attention
2020cited by this paper
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
2019cited by this paper
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
2019cited by this paper
What Makes Training Multi-Modal Classification Networks Hard?
2019cited by this paper
Efficient Large-Scale Multi-Modal Classification
2018cited by this paper
Continual learning of context-dependent processing in neural networks
2018cited by this paper
Look, Listen and Learn
2017cited by this paper
Missing Modalities Imputation via Cascaded Residual Autoencoder
2017cited by this paper
FiLM: Visual Reasoning with a General Conditioning Layer
2017cited by this paper
Sentiment Analysis on Multi-View Social Data
2016cited by this paper
Recipe recognition with large multimodal food dataset
2015cited by this paper
On Deep Multi-View Representation Learning
2015cited by this paper
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
2014cited by this paper
Deep Canonical Correlation Analysis
2013cited by this paper
IEMOCAP: interactive emotional dyadic motion capture database
2008cited by this paper
Greedy Layer-Wise Training of Deep Networks
2006cited by this paper
Affect Recognition from Face and Body Early Fusion vs. Late Fusion
2005cited by this paper
Numerically stable fast transversal filters for recursive least squares adaptive filtering
1991cited by this paper
Relations Between Two Sets of Variates
1936cited by this paper

CITED BY

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling
2026cites this paper
DRFusion: Enhancing balanced and sufficient multimodal learning for human emotion recognition
2026cites this paper
Multimodal Classification via Total Correlation Maximization
2026influential citation
R-DMRF-HPE: Robust Dynamic Multi-modal Radar-vision Fusion for Human Pose Estimation
2026cites this paper
Toward Reliable Multimodal Beam Prediction in mmWave Communications via Probabilistic Embedding and Uncertainty-Aware
2026cites this paper
FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance
2026cites this paper
DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities
2026cites this paper
Application Design of a Wide-Angle and Zoom Collaborative Visual Perception Device
2026cites this paper
Hippocampal surface morphological variation-based genome-wide association analysis network for biomarker detection of Alzheimer's disease.
2026cites this paper
Adaptive weighted temporal prototype network for multimodal emotion recognition
2026cites this paper
Multimodal Affect Perception With Large Language Model Enhancement Network
2026cites this paper
Design and implementation of a lightweight dual-lens camera adapted for logistics scenarios
2026cites this paper
Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception
2026cites this paper
Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
2026cites this paper
Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models
2026cites this paper
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
2026cites this paper
DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation
2025cites this paper
Multistage Training and Fusion Method for Imbalanced Multimodal UAV Remote Sensing Classification
2025cites this paper
G2D: Boosting Multimodal Learning with Gradient-Guided Distillation
2025influential citation
Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach
2025cites this paper
Improving Multimodal Learning via Imbalanced Learning
2025influential citation
Boosting Multimodal Learning via Disentangled Gradient Learning
2025influential citation
Med-GRIM: Enhanced Zero-Shot Medical VQA Using Prompt-Embedded Multimodal Graph RAG
2025cites this paper
MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization
2025cites this paper
AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning
2025cites this paper
A Design Pattern for Efficiency-Safety Dual-Objective Collaborative Optimization System Based on Fusion of Video and Logistics Business Data
2025cites this paper
Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective
2025cites this paper
Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation
2025cites this paper
A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition
2025cites this paper
Towards Equilibrium: An Instantaneous Probe-and-Rebalance Multimodal Learning Approach
2025influential citation
Interactive Multimodal Learning via Flat Gradient Modification
2025influential citation
Metering Asset Fault Perception and State Assessment Model Fusing Multi-modal Visual Learning
2025cites this paper
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
2025cites this paper
MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning
2025cites this paper
SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection
2025cites this paper
DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
2025cites this paper
Revisit Modality Imbalance at the Decision Layer
2025cites this paper
Logistics Safety Risk Early Warning Scheme and System Implementation Based on Multispectral Imaging and Edge Computing
2025cites this paper
Multimodal Fusion Sentiment Analysis Based on Spatial Frequency Enhancement
2025cites this paper
Multi-modal co-learning for Earth observation: enhancing single-modality models via modality collaboration
2025cites this paper
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
2025influential citation
Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise
2025cites this paper
Coordinated Uni-modal Assistance for Enhancing Multi-modal Learning
2025cites this paper
Supervised Attention Mechanism for Low-quality Multimodal Data
2025cites this paper
Two Challenges, One Solution: Robust Multimodal Learning through Dynamic Modality Recognition and Enhancement
2025cites this paper
Mitigating Modality Imbalance in Multi-modal Learning via Multi-objective Optimization
2025cites this paper
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition
2025cites this paper
Multimodal Emotion Recognition with Missing Modality via a Unified Multi-task Pre-training Framework
2025cites this paper
Balance-aware Sequence Sampling Makes Multi-modal Learning Better
2025influential citation
Robust Multimodal Learning via Cross-Modal Proxy Tokens
2025influential citation
BalanceBenchmark: A Survey for Multimodal Imbalance Learning
2025cites this paper
Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion
2025influential citation
Rebalanced Multimodal Learning with Data-aware Unimodal Sampling
2025influential citation
DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning
2025cites this paper
Towards Energy-efficient Audio-visual Classification via Multimodal Interactive Spiking Neural Network
2025cites this paper
OTMKGRL: a universal multimodal knowledge graph representation learning framework using optimal transport and cross-modal relation
2025cites this paper
MultiModalGraphSearch: Intelligent Massive-Scale SubGraph Discovery for Multi-Category Financial Pattern Mining
2025cites this paper
A review of hybrid EEG-based multimodal human–computer interfaces using deep learning: applications, advances, and challenges
2025cites this paper
Multimodal learning-based speech enhancement and separation, recent innovations, new horizons, challenges and real-world applications
2025cites this paper
Actual Cause-Guided Adaptive Gradient Scaling for Balanced Multimodal Sentiment Analysis
2025cites this paper
Learning Optimal Multimodal Information Bottleneck Representations
2025cites this paper
EgoVIS@CVPR: PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss
2025cites this paper
Improving Multimodal Learning Balance and Sufficiency through Data Remixing
2025influential citation
PSL: Prototype swapping learning for modality imbalance
2025cites this paper
Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization
2025cites this paper
Progressive intra- and inter-modality relation learning for multimodal sentiment analysis
2025cites this paper
Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis
2025cites this paper
Trustworthy Equipment Monitoring via Cascaded Anomaly Detection and Thermal Localization
2025cites this paper
mmDFC: Multi-Modality Dynamic Fusion and Modal-Decoupled Alternating Optimization for Retinal Vessel Occlusion Prognosis
2025cites this paper
Computer Interaction Methods and Modes for Epilepsy Patients
2025cites this paper
Multi-modal Adaptive Synchronous Learning for Imbalanced Dermatosis Diagnosis
2025cites this paper
A Prototype-Based Rebalancing Framework to Address Modality Imbalance in Multimodal Emotion Recognition
2025cites this paper
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
2025cites this paper
Trustworthy Equipment Monitoring via Cascaded Anomaly Detection and Saliency-Guided Inspection
2025cites this paper
Facilitating Multimodal Classification via Dynamically Learning Modality Gap
2024influential citation
Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP
2024cites this paper
Multimodal Fusion on Low-quality Data: A Comprehensive Survey
2024cites this paper
Soft-Weighted CrossEntropy Loss for Continous Alzheimer's Disease Detection
2024cites this paper
Balancing Multimodal Training Through Game-Theoretic Regularization
2024influential citation
RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
2024cites this paper
scMMAE: masked cross-attention network for single-cell multimodal omics fusion to enhance unimodal omics
2024cites this paper
Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
2024cites this paper
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
2024cites this paper
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
2024cites this paper
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
2024cites this paper
Multimodal Classification via Modal-Aware Interactive Enhancement
2024influential citation
Multimodal Clinical Trial Outcome Prediction with Large Language Models
2024cites this paper
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning
2024cites this paper
Soil Moisture Estimation in Precision Agriculture: A Knowledge Soil Moisture Estimation in Precision Agriculture: A Knowledge Distillation Approach Distillation Approach
year unknowncites this paper
Enhancing generalization with multi-modal relations
year unknowncites this paper