Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Visual reasoning abilities play a crucial role in understanding complex
multimodal data, advancing both domain-specific applications and artificial
general intelligence (AGI). Existing methods enhance Vision-Language Models
(VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously
annotated data. However, this approach may lead to overfitting and cognitive
rigidity, limiting the model's generalization ability under domain shifts and
reducing real-world applicability. To overcome these limitations, we propose
Reason-RFT, a two-stage reinforcement fine-tuning framework for visual
reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates
the reasoning potential of VLMs. This is followed by reinforcement learning
based on Group Relative Policy Optimization (GRPO), which generates multiple
reasoning-response pairs to enhance adaptability to domain shifts. To evaluate
Reason-RFT, we reconstructed a comprehensive dataset covering visual counting,
structural perception, and spatial transformation, serving as a benchmark for
systematic assessment across three key dimensions. Experimental results
highlight three advantages: (1) performance enhancement, with Reason-RFT
achieving state-of-the-art results and outperforming both open-source and
proprietary models; (2) generalization superiority, maintaining robust
performance under domain shifts across various tasks; and (3) data efficiency,
excelling in few-shot learning scenarios and surpassing full-dataset SFT
baselines. Reason-RFT introduces a novel training paradigm for visual reasoning
and marks a significant step forward in multimodal research. Project website:
https://tanhuajie.github.io/ReasonRFT
Key Contributions
Reason-RFT is a novel two-stage reinforcement fine-tuning framework that enhances visual reasoning in VLMs. It combines SFT with RL (GRPO) to improve adaptability to domain shifts and reduce overfitting, addressing limitations of purely supervised CoT methods and enabling more robust and generalizable visual reasoning.
Business Value
Improves the reliability and robustness of AI systems that need to understand and reason about visual information in diverse and changing environments. This is critical for applications like autonomous driving, medical diagnosis, and advanced robotics.