arxiv_cv 95% Match Research Paper AI Researchers,Machine Learning Engineers,LLM Developers,Multimodal AI Specialists 2 weeks ago

Directional Reasoning Injection for Fine-Tuning MLLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

Authors (10)

Chao Huang

Zeliang Zhang

Jiang Liu

Ximeng Sun

Jialian Wu

Xiaodong Yu

+4 more

Submitted

October 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

DRIFT (Directional Reasoning Injection for Fine-Tuning) MLLMs is a lightweight method that transfers reasoning knowledge in the gradient space without destabilizing multimodal alignment. It addresses the variability of naive model merging by precomputing a reasoning prior as the parameter-space difference between reasoning-enhanced LLMs and multimodal variants.

Business Value

Enables the development of more capable and versatile multimodal AI systems that can perform complex reasoning tasks, leading to improved performance in applications like visual question answering and multimodal understanding.

Paper Metadata

Innovation Type

Novel Method for Knowledge Transfer in MLLMs

Deployment Feasibility

As a lightweight fine-tuning method, DRIFT is highly feasible for integration into existing MLLM development pipelines.

Limitations Addressed

Addresses the limitations of existing methods (supervised fine-tuning, RL) for improving MLLM reasoning, which are resource-intensive. It also tackles the instability and performance degradation observed with naive model merging techniques.

Performance Gains

lightweight method,transfers reasoning knowledge effectively,avoids destabilizing multimodal alignment

Technical Tags

multimodal large language models (MLLMs)reasoning abilitydirectional reasoning injectionmodel merginggradient space transfermultimodal alignmentfine-tuningparameter interpolationreasoning priorlightweight method

Research Topics

Artificial IntelligenceMachine LearningNatural Language ProcessingMultimodal AILarge Language ModelsReasoningModel Merging

Methods & Architectures

Directional Reasoning Injection (DRIFT)Gradient Space TransferModel MergingParameter-Space Difference Calculation Multimodal Large Language Models (MLLMs)Large Language Models (LLMs)

Applications & Tasks

Artificial Intelligence Research Natural Language Processing Computer Vision Improving MLLM ReasoningResource-Intensive Fine-tuningModel Merging InstabilityPreserving Multimodal Alignment Transferring Reasoning KnowledgeEnhancing MLLM Reasoning CapabilitiesLightweight MLLM Fine-tuning

Related Fields

Model CompressionTransfer LearningAI AlignmentMultimodal Machine LearningNatural Language Understanding

Keywords

Multimodal LLMsReasoningModel MergingFine-tuningGradient SpaceKnowledge TransferDRIFTLightweight MethodMultimodal AlignmentParameter SpaceLLMsMLLMsAI AlignmentParameter Interpolation

Academic Context

#Artificial Intelligence#Machine Learning#Natural Language Processing#Multimodal AI#Large Language Models#Reasoning#Model Merging

Commercial Potential

Potential Products

Tools for enhancing MLLM reasoning capabilitiesFrameworks for efficient MLLM fine-tuningModel merging solutions for multimodal AI

Target Industries

TechnologyArtificial IntelligenceSoftware DevelopmentResearch & Development

Use Case Examples

Improving the reasoning performance of visual question answering modelsEnhancing MLLMs for tasks requiring complex multimodal understandingEfficiently transferring reasoning skills to new multimodal models

Competitive Edge

Offers a more stable and efficient alternative to traditional fine-tuning and naive model merging for improving MLLM reasoning.

Market Opportunity

Rapid growth and investment in multimodal AI,Increasing demand for advanced reasoning capabilities in AI

Revenue Models

Licensing of the DRIFT techniqueintegration into AI development platforms.

Resource Requirements

Compute Needs

Moderate, as it's a lightweight fine-tuning method.

Data Requirements

Requires access to reasoning-enhanced LLMs and multimodal variants for calculating the reasoning prior.

Deployment Constraints

Compatibility with various MLLM architectures,Potential sensitivity to the quality of the reasoning prior

Scalability

The method is designed to be lightweight, suggesting good scalability.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into MLLM development frameworks.

View Full Paper Back to Papers