Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper ML Researchers,AI Engineers,Developers of Multimodal AI 1 week ago

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

large-language-models › reasoning
📄 Abstract

Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
Authors (3)
Walid Bousselham
Hilde Kuehne
Cordelia Schmid
Submitted
October 27, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Proposes VOLD, a framework for transferring reasoning capabilities from text-only LLMs to VLMs using on-policy distillation combined with GRPO. It highlights the importance of cold-start and distributional alignment between teacher and student models for effective reasoning transfer, addressing data scarcity issues.

Business Value

Enables the development of more capable VLMs with advanced reasoning abilities by leveraging existing large-scale text reasoning datasets, potentially leading to more intelligent AI assistants and analysis tools.