Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their
reasoning ability often lags behind that of strong text-only counterparts.
Existing methods to bridge this gap rely on supervised fine-tuning over
large-scale multimodal reasoning data or reinforcement learning, both of which
are resource-intensive. A promising alternative is model merging, which
interpolates parameters between reasoning-enhanced LLMs and multimodal
variants. However, our analysis shows that naive merging is not always a "free
lunch": its effectiveness varies drastically across model families, with some
(e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance
degradation. To address this, we propose Directional Reasoning Injection for
Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning
knowledge in the gradient space, without destabilizing multimodal alignment.
DRIFT precomputes a reasoning prior as the parameter-space difference between
reasoning and multimodal variants, then uses it to bias gradients during
multimodal fine-tuning. This approach preserves the simplicity of standard
supervised fine-tuning pipelines while enabling efficient reasoning transfer.
Extensive experiments on multimodal reasoning benchmarks, including MathVista
and MathVerse, demonstrate that DRIFT consistently improves reasoning
performance over naive merging and supervised fine-tuning, while matching or
surpassing training-heavy methods at a fraction of the cost.
Authors (10)
Chao Huang
Zeliang Zhang
Jiang Liu
Ximeng Sun
Jialian Wu
Xiaodong Yu
+4 more
Submitted
October 16, 2025
Key Contributions
DRIFT (Directional Reasoning Injection for Fine-Tuning) MLLMs is a lightweight method that transfers reasoning knowledge in the gradient space without destabilizing multimodal alignment. It addresses the variability of naive model merging by precomputing a reasoning prior as the parameter-space difference between reasoning-enhanced LLMs and multimodal variants.
Business Value
Enables the development of more capable and versatile multimodal AI systems that can perform complex reasoning tasks, leading to improved performance in applications like visual question answering and multimodal understanding.