arxiv_ml 75% Match Research Paper ML researchers,AI engineers working with multimodal data,Developers of AI systems for complex environments 1 week ago

Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

large-language-models › multimodal-llms

📄 Abstract

Abstract: In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.

Authors (4)

Hossein R. Nowdeh

Jie Ji

Xiaolong Ma

Fatemeh Afghah

Submitted

October 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that harmonizes multimodal learning by identifying dominant modalities and modulating gradients to improve robustness. It ensures better learning for underrepresented modalities, enhancing overall generalization.

Business Value

Enables the development of more reliable and versatile AI systems that can effectively process and integrate information from various sources (e.g., text, images, audio), leading to richer applications.

Paper Metadata

Innovation Type

Algorithmic Framework

Deployment Feasibility

High, as it's a model-agnostic framework that can be applied to existing multimodal architectures.

Limitations Addressed

Dominant modalities overshadowing others in multimodal learning, leading to limited generalization.

Performance Gains

Outperforms the latest methods on four diverse datasets by improving generalization and robustness in multimodal learning.

Technical Tags

Multimodal LearningSharpness-Aware Minimization (SAM)Gradient ModulationDominant ModalityRobustnessDeep LearningModel AgnosticFusion StrategiesShapley ValuesGeneralization

Research Topics

Harmonizing Multimodal LearningAddressing Modality DominanceImproving Model Robustness in Multimodal SettingsGradient Modulation TechniquesModel-Agnostic Frameworks for Multimodality

Methods & Architectures

Modality-Aware Sharpness-Aware Minimization (M-SAM)Shapley Value for Modality IdentificationLoss Landscape DecompositionModulated Gradient UpdatesEarly FusionLate Fusion Model-Agnostic Framework

Applications & Tasks

Multimodal AI Computer Vision Natural Language Processing Robotics Healthcare Modality ImbalancePoor Generalization in Multimodal ModelsImproving RobustnessHarmonizing Learning Across Modalities Multimodal ClassificationMultimodal RegressionAny task requiring integration of multiple data types

Datasets & Benchmarks

Datasets

Four diverse datasets

Generalization PerformanceRobustness

Related Fields

Machine LearningDeep LearningComputer VisionNatural Language ProcessingArtificial Intelligence

Keywords

Multimodal LearningSharpness-Aware MinimizationSAMGradient ModulationDominant ModalityRobustnessDeep LearningModel AgnosticFusionGeneralizationShapley ValuesAI Harmonization

Academic Context

#Harmonizing Multimodal Learning#Addressing Modality Dominance#Improving Model Robustness in Multimodal Settings#Gradient Modulation Techniques#Model-Agnostic Frameworks for Multimodality

Commercial Potential

Potential Products

More robust multimodal AI assistantsAdvanced systems for content analysis (e.g., video with audio and text)Improved diagnostic tools combining medical images and reports

Target Industries

TechnologyHealthcareMediaAutomotiveRobotics

Use Case Examples

A system that understands a video by combining visual cues, spoken dialogue, and on-screen textA medical diagnosis tool that integrates patient history, imaging, and lab results

Competitive Edge

Offers a novel SAM-based approach to specifically tackle modality dominance in multimodal learning, providing a generalizable and model-agnostic solution.

Resource Requirements

Compute Needs

Moderate to high, depending on the complexity of the multimodal models and datasets.

Data Requirements

Multimodal datasets (e.g., image-text pairs, video-audio-text).

Deployment Constraints

Requires careful tuning of SAM parameters and modality identification.

Scalability

As a model-agnostic framework, its scalability depends on the underlying multimodal models it's applied to.

Production Readiness

Maturity Level

Research/Framework

Time to Market

Medium (for integration into existing systems)

View Full Paper Back to Papers