Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Machine Learning Engineers,LLM Developers,Multimodal AI Specialists 2 weeks ago

Directional Reasoning Injection for Fine-Tuning MLLMs

large-language-models › multimodal-llms
📄 Abstract

Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
Authors (10)
Chao Huang
Zeliang Zhang
Jiang Liu
Ximeng Sun
Jialian Wu
Xiaodong Yu
+4 more
Submitted
October 16, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

DRIFT (Directional Reasoning Injection for Fine-Tuning) MLLMs is a lightweight method that transfers reasoning knowledge in the gradient space without destabilizing multimodal alignment. It addresses the variability of naive model merging by precomputing a reasoning prior as the parameter-space difference between reasoning-enhanced LLMs and multimodal variants.

Business Value

Enables the development of more capable and versatile multimodal AI systems that can perform complex reasoning tasks, leading to improved performance in applications like visual question answering and multimodal understanding.