arxiv_ai 94% Match Research Paper AI Researchers,ML Engineers,Computer Vision Specialists,NLP Researchers 4 weeks ago

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

large-language-models › reasoning

📄 Abstract

Abstract: Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model's generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research. Project website: https://tanhuajie.github.io/ReasonRFT

Key Contributions

Reason-RFT is a novel two-stage reinforcement fine-tuning framework that enhances visual reasoning in VLMs. It combines SFT with RL (GRPO) to improve adaptability to domain shifts and reduce overfitting, addressing limitations of purely supervised CoT methods and enabling more robust and generalizable visual reasoning.

Business Value

Improves the reliability and robustness of AI systems that need to understand and reason about visual information in diverse and changing environments. This is critical for applications like autonomous driving, medical diagnosis, and advanced robotics.

Paper Metadata

Innovation Type

Training Framework

Deployment Feasibility

Moderate. Requires significant computational resources for RL training and careful curation of training data.

Limitations Addressed

Overfitting and cognitive rigidity from supervised fine-tuning,Limited generalization ability under domain shifts,Lack of adaptability in existing VLM reasoning methods

Performance Gains

Enhanced adaptability to domain shifts through RL.

Technical Tags

visual reasoningvision-language modelsreinforcement learningchain-of-thoughtsupervised fine-tuningdomain adaptationgeneralizationGroup Relative Policy Optimization (GRPO)multimodal AIcognitive rigidity

Research Topics

Artificial IntelligenceNatural Language ProcessingComputer VisionMachine LearningReasoning

Methods & Architectures

Two-stage reinforcement fine-tuningSupervised Fine-Tuning (SFT)Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO)Curated Chain-of-Thought (CoT) data Vision-Language Models (VLMs)

Applications & Tasks

Multimodal AI Robotics Autonomous Systems Information Retrieval Visual Reasoning EnhancementDomain GeneralizationOverfitting Mitigation Improving VLM visual reasoning capabilitiesEnhancing adaptability to domain shiftsReducing cognitive rigidity in reasoning processes

Datasets & Benchmarks

Datasets

Comprehensive dataset covering visual coupling (reconstructed)

Adaptability to domain shiftsGeneralization performance

Related Fields

AI SafetyCognitive ScienceRobotics

Keywords

visual reasoningvision-language modelsreinforcement learningchain-of-thoughtsupervised fine-tuningdomain adaptationgeneralizationmultimodal AIGRPOdeep learningAI reasoning

Academic Context

#Artificial Intelligence#Natural Language Processing#Computer Vision#Machine Learning#Reasoning

Technology Stack

Frameworks & Libraries

GRPO

Commercial Potential

Potential Products

More robust multimodal AI assistantsAdvanced visual question answering systemsAI for autonomous systems requiring visual understanding

Target Industries

AutomotiveHealthcareRoboticsSecurityE-commerce

Use Case Examples

Enabling autonomous vehicles to reason about complex traffic scenariosAI systems that can diagnose medical conditions from images and reportsRobots that can understand and interact with their environment visually

Competitive Edge

Offers a more adaptive and generalizable approach to visual reasoning compared to static supervised methods by incorporating reinforcement learning to handle domain shifts and reduce overfitting.

Resource Requirements

Compute Needs

High (for RL training)

Data Requirements

Curated Chain-of-Thought data, potentially diverse datasets for domain shift evaluation.

Deployment Constraints

Complexity of the RL training process, potential for unexpected behaviors.

Scalability

Scalability depends on the efficiency of the RL algorithm and the VLM architecture.

View Full Paper Back to Papers