arxiv_cv 95% Match Research Paper ML Researchers,AI Engineers,Developers of Multimodal AI 1 week ago

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

large-language-models › reasoning

📄 Abstract

Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

Authors (3)

Walid Bousselham

Hilde Kuehne

Cordelia Schmid

Submitted

October 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes VOLD, a framework for transferring reasoning capabilities from text-only LLMs to VLMs using on-policy distillation combined with GRPO. It highlights the importance of cold-start and distributional alignment between teacher and student models for effective reasoning transfer, addressing data scarcity issues.

Business Value

Enables the development of more capable VLMs with advanced reasoning abilities by leveraging existing large-scale text reasoning datasets, potentially leading to more intelligent AI assistants and analysis tools.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires careful setup of teacher-student training pipelines and alignment strategies.

Limitations Addressed

Scarcity of high-quality image-text reasoning data,Difficulty in leveraging abundant text-based reasoning resources for VLMs,Ineffective reasoning transfer without proper alignment

Technical Tags

vision-language modelsreasoning transferLLM distillationon-policy distillationreinforcement learningGRPOcold-start alignmentdistributional alignmentdata scarcityteacher-student learning

Research Topics

Multimodal AIKnowledge TransferReasoning in AIReinforcement LearningLarge Language Models

Methods & Architectures

VOLD frameworkOn-policy distillationGroup Relative Policy Optimization (GRPO)Teacher-student learningCold-start alignmentDistributional alignment Vision-Language Model (VLM)Large Language Model (LLM)

Applications & Tasks

AI Research Multimodal AI Development ReasoningData ScarcityKnowledge Transfer Complex Reasoning in VLMsTransferring Text-based Reasoning to Vision-Language Tasks

Related Fields

Machine LearningDeep LearningComputer VisionNatural Language ProcessingReinforcement Learning

Keywords

vision-language modelsreasoningLLM distillationreinforcement learningGRPOknowledge transferdata scarcityalignmentteacher-studentmultimodal AIVLMLLM

Academic Context

#Multimodal AI#Knowledge Transfer#Reasoning in AI#Reinforcement Learning#Large Language Models

Commercial Potential

Potential Products

More sophisticated multimodal question-answering systemsAI agents capable of complex visual reasoningTools for generating richer image descriptions

Target Industries

TechnologyMediaEducation

Use Case Examples

Enabling VLMs to solve complex visual reasoning problems previously only solvable by text LLMsImproving the ability of VLMs to explain visual contentDeveloping AI assistants that can reason across text and images

Competitive Edge

Provides a novel method for transferring reasoning capabilities from powerful text LLMs to VLMs, addressing the data scarcity challenge and potentially achieving higher reasoning performance than VLMs trained from scratch.

Market Opportunity

Large (AI market, specifically multimodal AI)

Revenue Models

Licensing of modelsAPI access

Resource Requirements

Compute Needs

High (for training large VLMs and LLMs)

Data Requirements

Requires text-based reasoning datasets for the teacher model and image-text data for the VLM student.

Deployment Constraints

Computational cost of training,Complexity of managing teacher and student models,Ensuring effective alignment

Scalability

Scalability depends on the efficiency of the distillation process and the underlying VLM architecture.

Regulatory Considerations

Low

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate (novel distillation and alignment techniques)

View Full Paper Back to Papers