Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While Reinforcement Learning with Verifiable Rewards has enhanced the
reasoning of large-scale language models (LLMs), its efficacy for lightweight
multimodal language models (MLLMs) with fewer than seven billion parameters
remains underexplored. This paper investigates the role of long
Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such
MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT
data significantly improves MLLM reasoning. Furthermore, we observe that after
this initial SFT phase, MLLMs can achieve additional performance gains through
a subsequent RL stage. We conclude that a SFT stage with long CoT data is a
critical prerequisite for developing the reasoning capabilities of lightweight
MLLMs.