Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 70% Match Research Paper Machine Learning Researchers,Computer Vision Engineers,Data Scientists 1 week ago

Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning

computer-vision › medical-imaging
📄 Abstract

Abstract: Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbalance degree between modalities. To address this gap, our work introduces a novel method for the quantitative analysis of multi-modal imbalance, which in turn informs the design of a sample-level adaptive loss function.We begin by defining the "Modality Gap" as the difference between the Softmax scores of different modalities (e.g., audio and visual) for the ground-truth class prediction. Analysis of the Modality Gap distribution reveals that it can be effectively modeled by a bimodal Gaussian Mixture Model (GMM). These two components are found to correspond respectively to "modality-balanced" and "modality-imbalanced" data samples. Subsequently, we apply Bayes' theorem to compute the posterior probability of each sample belonging to these two distinct distributions.Informed by this quantitative analysis, we design a novel adaptive loss function with three objectives: (1) to minimize the overall Modality Gap; (2) to encourage the imbalanced sample distribution to shift towards the balanced one; and (3) to apply greater penalty weights to imbalanced samples. We employ a two-stage training strategy consisting of a warm-up phase followed by an adaptive training phase.Experimental results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets, attaining accuracies of $80.65\%$ and $70.90\%$, respectively. This validates the effectiveness of our proposed methodology.
Authors (3)
Zhaocheng Liu
Zhiwen Yu
Xiaoqing Liu
Submitted
October 20, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper introduces a novel method for quantitatively analyzing multimodal imbalance by defining and modeling the 'Modality Gap' using a Gaussian Mixture Model. This analysis informs the design of a sample-level adaptive loss function, which is crucial for improving the performance of multimodal learning systems by addressing imbalances between different data modalities.

Business Value

Improved performance in multimodal AI systems can lead to more accurate and robust applications in areas like video analysis, speech recognition combined with visual cues, and human-computer interaction, where data from different sources may have varying levels of reliability or importance.