Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Training vision-language models (VLMs) for complex reasoning remains a
challenging task, i.a. due to the scarcity of high-quality image-text reasoning
data. Conversely, text-based reasoning resources are abundant and scalable, but
it is still an open question how to leveraging them for VLM reasoning. To
address this problem, we propose VOLD, a framework to transfer reasoning
capabilities from text-only teacher models to VLM student models. To this end,
VOLD combines reinforcement learning via Group Relative Policy Optimization
(GRPO) with on-policy distillation, which allows the student reasoning traces
to be guided by the teacher model, resulting in a significant gain over using
GRPO alone. We further show that a cold-start alignment is essential for an
effective transfer during the online training phase in this scenario and that
without sufficient distributional alignment between teacher and student,
on-policy distillation fails to provide meaningful guidance. We evaluate VOLD
across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and
LogicVista, showing that VOLD outperforms the baseline model significantly and
improves over the state of the art by a margin. Our ablation shows the
importance of a cold-start alignment via SFT for on-policy distillation with a
text-only teacher.
Authors (3)
Walid Bousselham
Hilde Kuehne
Cordelia Schmid
Submitted
October 27, 2025
Key Contributions
Proposes VOLD, a framework for transferring reasoning capabilities from text-only LLMs to VLMs using on-policy distillation combined with GRPO. It highlights the importance of cold-start and distributional alignment between teacher and student models for effective reasoning transfer, addressing data scarcity issues.
Business Value
Enables the development of more capable VLMs with advanced reasoning abilities by leveraging existing large-scale text reasoning datasets, potentially leading to more intelligent AI assistants and analysis tools.