Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Current mainstream approaches to addressing multimodal imbalance primarily
focus on architectural modifications and optimization-based, often overlooking
a quantitative analysis of the imbalance degree between modalities. To address
this gap, our work introduces a novel method for the quantitative analysis of
multi-modal imbalance, which in turn informs the design of a sample-level
adaptive loss function.We begin by defining the "Modality Gap" as the
difference between the Softmax scores of different modalities (e.g., audio and
visual) for the ground-truth class prediction. Analysis of the Modality Gap
distribution reveals that it can be effectively modeled by a bimodal Gaussian
Mixture Model (GMM). These two components are found to correspond respectively
to "modality-balanced" and "modality-imbalanced" data samples. Subsequently, we
apply Bayes' theorem to compute the posterior probability of each sample
belonging to these two distinct distributions.Informed by this quantitative
analysis, we design a novel adaptive loss function with three objectives: (1)
to minimize the overall Modality Gap; (2) to encourage the imbalanced sample
distribution to shift towards the balanced one; and (3) to apply greater
penalty weights to imbalanced samples. We employ a two-stage training strategy
consisting of a warm-up phase followed by an adaptive training
phase.Experimental results demonstrate that our approach achieves
state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets,
attaining accuracies of $80.65\%$ and $70.90\%$, respectively. This validates
the effectiveness of our proposed methodology.
Authors (3)
Zhaocheng Liu
Zhiwen Yu
Xiaoqing Liu
Submitted
October 20, 2025
Key Contributions
This paper introduces a novel method for quantitatively analyzing multimodal imbalance by defining and modeling the 'Modality Gap' using a Gaussian Mixture Model. This analysis informs the design of a sample-level adaptive loss function, which is crucial for improving the performance of multimodal learning systems by addressing imbalances between different data modalities.
Business Value
Improved performance in multimodal AI systems can lead to more accurate and robust applications in areas like video analysis, speech recognition combined with visual cues, and human-computer interaction, where data from different sources may have varying levels of reliability or importance.