arxiv_ai 92% Match Research Paper ML Researchers,AI Engineers,Developers working with MLLMs 2 weeks ago

Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is efficient to reuse the existing ones and extend them to more modalities through Modality-incremental Continual Learning (MCL). The exploration of MCL is in its early stages. In this work, we dive into the causes of performance degradation in MCL. We uncover that it suffers not only from forgetting as in traditional continual learning, but also from misalignment between the modality-agnostic and modality-specific components. To this end, we propose an elegantly simple MCL paradigm called "MErge then ReAlign" (MERA) to address both forgetting and misalignment. MERA avoids introducing heavy model budgets or modifying model architectures, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate the impressive performance of MERA, holding an average of 99.84\% Backward Relative Gain when extending to four modalities, achieving nearly lossless MCL performance. Our findings underscore the misalignment issue in MCL. More broadly, our work showcases how to adjust different components of MLLMs during continual learning.

Authors (5)

Dingkun Zhang

Shuhan Qi

Xinyu Xiao

Kehai Chen

Xuan Wang

Submitted

March 8, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Proposes MERA (Merge then ReAlign), a simple and effective paradigm for Modality-Incremental Continual Learning (MCL) in MLLMs that addresses both forgetting and misalignment without heavy model budgets or architecture modifications. It aims to make MLLM extension more efficient and reusable.

Business Value

Reduces the significant cost and time associated with retraining large multimodal models when new data modalities become available, enabling faster adaptation and broader application.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High. The method is designed to be simple, avoid heavy budgets, and not modify architectures, making it easy to deploy.

Limitations Addressed

Performance degradation (forgetting and misalignment) in MLLMs during modality-incremental learning,High cost of training MLLMs from scratch

Technical Tags

multimodal LLMscontinual learningmodality-incrementalforgetting mitigationalignmentparameter-efficientmodel reusetransfer learningcatastrophic forgettingLLM adaptation

Research Topics

Multimodal LearningContinual LearningLarge Language ModelsModel Adaptation

Methods & Architectures

Merge then Realign (MERA)Modality-Incremental Continual Learning (MCL) Multimodal Large Language Models (MLLMs)

Applications & Tasks

Multimodal AI AI Model Development Performance Degradation in Continual LearningModality Alignment Extending MLLMs to new modalitiesContinual learning for MLLMs

Related Fields

Machine LearningNatural Language ProcessingComputer VisionContinual Learning

Keywords

multimodal LLMscontinual learningmodality-incrementalMERAforgettingalignmentparameter-efficientmodel reusetransfer learningcatastrophic forgettingLLM adaptationdeep learning

Academic Context

#Multimodal Learning#Continual Learning#Large Language Models#Model Adaptation

Commercial Potential

Potential Products

Framework for efficient MLLM extensionTools for adapting existing MLLMs to new data types

Target Industries

TechnologyAI DevelopmentSoftware

Use Case Examples

Adding audio processing capabilities to a text-image MLLMUpdating an MLLM with new sensor data without full retraining

Competitive Edge

Offers a simpler and more efficient approach to modality-incremental learning for MLLMs compared to methods that might require extensive retraining or architectural changes.

Market Opportunity

Rapid growth in MLLM development and application.

Revenue Models

Integration into MLLM platformslicensing of the adaptation technique.

Resource Requirements

Compute Needs

Moderate (less than full retraining)

Data Requirements

New modality data for adaptation

Deployment Constraints

Requires access to the original MLLM weights

Scalability

Scales well as it avoids significant architectural changes and heavy model budgets.

Production Readiness

Maturity Level

Research

Time to Market

6-18 months

Patent Potential

Low (algorithmic approach, likely not patentable)

View Full Paper Back to Papers