Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 94% Match Research Paper AI Researchers,ML Engineers,Computer Vision Specialists,NLP Researchers 4 weeks ago

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

large-language-models › reasoning
📄 Abstract

Abstract: Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model's generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research. Project website: https://tanhuajie.github.io/ReasonRFT

Key Contributions

Reason-RFT is a novel two-stage reinforcement fine-tuning framework that enhances visual reasoning in VLMs. It combines SFT with RL (GRPO) to improve adaptability to domain shifts and reduce overfitting, addressing limitations of purely supervised CoT methods and enabling more robust and generalizable visual reasoning.

Business Value

Improves the reliability and robustness of AI systems that need to understand and reason about visual information in diverse and changing environments. This is critical for applications like autonomous driving, medical diagnosis, and advanced robotics.