Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enhanced
their versatility as they integrate a growing number of modalities. Considering
the heavy cost of training MLLMs, it is efficient to reuse the existing ones
and extend them to more modalities through Modality-incremental Continual
Learning (MCL). The exploration of MCL is in its early stages. In this work, we
dive into the causes of performance degradation in MCL. We uncover that it
suffers not only from forgetting as in traditional continual learning, but also
from misalignment between the modality-agnostic and modality-specific
components. To this end, we propose an elegantly simple MCL paradigm called
"MErge then ReAlign" (MERA) to address both forgetting and misalignment. MERA
avoids introducing heavy model budgets or modifying model architectures, hence
is easy to deploy and highly reusable in the MLLM community. Extensive
experiments demonstrate the impressive performance of MERA, holding an average
of 99.84\% Backward Relative Gain when extending to four modalities, achieving
nearly lossless MCL performance. Our findings underscore the misalignment issue
in MCL. More broadly, our work showcases how to adjust different components of
MLLMs during continual learning.
Authors (5)
Dingkun Zhang
Shuhan Qi
Xinyu Xiao
Kehai Chen
Xuan Wang
Key Contributions
Proposes MERA (Merge then ReAlign), a simple and effective paradigm for Modality-Incremental Continual Learning (MCL) in MLLMs that addresses both forgetting and misalignment without heavy model budgets or architecture modifications. It aims to make MLLM extension more efficient and reusable.
Business Value
Reduces the significant cost and time associated with retraining large multimodal models when new data modalities become available, enabling faster adaptation and broader application.