Mitigating Modality Imbalance in Multi-modal Learning via Multi-objective Optimization

Heshan Fernando,Parikshit Ram,Yi Zhou,Soham Dan,Horst Samulowitz,Nathalie Baracaldo,Tianyi Chen

Published 2025 in arXiv.org

ABSTRACT

Multi-modal learning (MML) aims to integrate information from multiple modalities, which is expected to lead to superior performance over single-modality learning. However, recent studies have shown that MML can underperform, even compared to single-modality approaches, due to imbalanced learning across modalities. Methods have been proposed to alleviate this imbalance issue using different heuristics, which often lead to computationally intensive subroutines. In this paper, we reformulate the MML problem as a multi-objective optimization (MOO) problem that overcomes the imbalanced learning issue among modalities and propose a gradient-based algorithm to solve the modified MML problem. We provide convergence guarantees for the proposed method, and empirical evaluations on popular MML benchmarks showcasing the improved performance of the proposed method over existing balanced MML and MOO baselines, with up to ~20x reduction in subroutine computation time. Our code is available at https://github.com/heshandevaka/MIMO.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-10
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2511.06686 arXiv 2511.06686
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance
2024influential reference
STAR: A Benchmark for Situated Reasoning in Real-World Videos
2024cited by this paper
Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond
2024cited by this paper
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
2023cited by this paper
A Theory of Multimodal Learning
2023cited by this paper
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation
2023influential reference
Provable Dynamic Fusion for Low-Quality Multimodal Data
2023cited by this paper
Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance
2023cited by this paper
Parts of Speech-Grounded Subspaces in Vision-Language Models
2023cited by this paper
On Uni-Modal Feature Learning in Supervised Multi-Modal Learning
2023cited by this paper
GPT-4 Technical Report
2023cited by this paper
A Theory of Unimodal Bias in Multimodal Learning
2023influential reference
Multimodal Representation Learning by Alternating Unimodal Adaptation
2023cited by this paper
On penalty-based bilevel gradient descent method
2023cited by this paper
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
2022cited by this paper
Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach
2022influential reference
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
2022cited by this paper
Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization
2022cited by this paper
Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion
2022cited by this paper
A Generalist Agent
2022cited by this paper
SHAPE: An Unified Approach to Evaluate the Contribution and Cooperation of Individual Modalities
2022cited by this paper
Trusted Multi-View Classification With Dynamic Evidential Fusion
2022cited by this paper
Are Multimodal Transformers Robust to Missing Modality?
2022cited by this paper
Balanced Multimodal Learning via On-the-fly Gradient Modulation
2022influential reference
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
2022influential reference
On the Convergence of Stochastic Multi-Objective Gradient Manipulation and Beyond
2022cited by this paper
Conflict-Averse Gradient Descent for Multi-task Learning
2021cited by this paper
Uncertainty-Aware Multi-View Representation Learning
2021cited by this paper
Adversarial Reweighting for Partial Domain Adaptation
2021cited by this paper
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
2020cited by this paper
Agnostic Learning with Multiple Objectives
2020cited by this paper
Vggsound: A Large-Scale Audio-Visual Dataset
2020cited by this paper
Neural Machine Translation with Universal Visual Representation
2020cited by this paper
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
2020cited by this paper
Gradient Surgery for Multi-Task Learning
2020cited by this paper
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
2020cited by this paper
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
2019cited by this paper
What Makes Training Multi-Modal Classification Networks Hard?
2019cited by this paper
Modality-Specific Learning Rate Control for Multimodal Classification
2019cited by this paper
The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning
2019cited by this paper
Dynamic Task Prioritization for Multitask Learning
2018cited by this paper
Audio-Visual Event Localization in Unconstrained Videos
2018cited by this paper
Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
2018influential reference
Complexity of gradient descent for multiobjective optimization
2018cited by this paper
Multi-Task Learning as Multi-Objective Optimization
2018cited by this paper
CentralNet: a Multilayer Approach for Multimodal Fusion
2018cited by this paper
Visualizing the Loss Landscape of Neural Nets
2017cited by this paper
The Kinetics Human Action Video Dataset
2017cited by this paper
Multimodal Learning and Reasoning for Visual Question Answering
2017cited by this paper
Look, Listen and Learn
2017cited by this paper
Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
2017cited by this paper
GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
2017cited by this paper
VQA: Visual Question Answering
2015cited by this paper
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
2014influential reference
Multiple-gradient descent algorithm (MGDA) for multiobjective optimization
2012cited by this paper
Multi-modal Learning
2010cited by this paper
Smooth minimization of non-smooth functions
2005cited by this paper
On the Relationship of the Tchebycheff Norm and the Efficient Frontier of Multiple-Criteria Objectives
1976cited by this paper

CITED BY

When Gradient Optimization Is Not Enough: † Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
2026cites this paper