arxiv_ml 95% Match Research Paper VLM Researchers,AI Alignment Researchers,Multimodal AI Developers,Robotics Engineers 1 week ago

Sherlock: Self-Correcting Reasoning in Vision-Language Models

large-language-models › reasoning

📄 Abstract

Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Authors (2)

Yi Ding

Ruqi Zhang

Submitted

May 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Sherlock, a self-correction and self-improvement training framework for reasoning Vision-Language Models (VLMs). It features a trajectory-level self-correction objective and a preference data construction method using visual perturbation, enabling models to improve their reasoning without external supervision after initial training.

Business Value

Leads to more reliable and adaptable multimodal AI systems, capable of performing complex reasoning tasks in diverse environments, reducing development costs associated with data annotation.

Paper Metadata

Innovation Type

Algorithmic/Training Methodology

Deployment Feasibility

High, as it focuses on improving existing VLM architectures and training paradigms.

Limitations Addressed

High sensitivity to reasoning errors, need for large annotated datasets or accurate verifiers, and poor generalization beyond specific domains in current reasoning VLMs.

Performance Gains

Significant improvements in reasoning accuracy and generalization, with reduced reliance on supervised data.

Technical Tags

Vision-Language Models (VLMs)ReasoningSelf-CorrectionSelf-ImprovementMultimodal AIPreference TuningTrajectory OptimizationGeneralization

Research Topics

Multimodal ReasoningAI AlignmentModel ImprovementGeneralization in AIVision-Language Understanding

Methods & Architectures

Self-correction frameworkSelf-improvement trainingTrajectory-level self-correction objectivePreference data construction (visual perturbation)Dynamic beta for preference tuning Vision-Language Models (VLMs)Llama3.2-Vision (base model)

Applications & Tasks

Multimodal AI Robotics Human-Computer Interaction Reasoning errors in VLMsSensitivity to errorsNeed for large annotated datasetsDomain generalization challenges Improving reasoning accuracy in VLMsEnhancing generalization capabilitiesReducing reliance on large annotated datasets

Related Fields

Computer VisionNatural Language ProcessingMachine LearningAI AlignmentRobotics

Keywords

Vision-Language ModelsVLMsReasoningSelf-CorrectionSelf-ImprovementMultimodal AIPreference TuningGeneralizationAI AlignmentLLMsDeep LearningTrajectory Optimization

Academic Context

#Multimodal Reasoning#AI Alignment#Model Improvement#Generalization in AI#Vision-Language Understanding

Technology Stack

Frameworks & Libraries

Llama3.2-Vision (base model)

Commercial Potential

Potential Products

More robust and generalizable multimodal assistantsAI systems for complex visual question answering and instruction following

Target Industries

RoboticsAutonomous SystemsContent CreationCustomer Service

Use Case Examples

Robots that can understand and execute complex visual instructionsAI assistants that can reason about images and text for creative tasksImproved visual question answering systems

Competitive Edge

Addresses critical limitations in current reasoning VLMs by introducing a novel self-correction and self-improvement mechanism, enabling better generalization and reduced data dependency.

Market Opportunity

Rapidly growing market for multimodal AI and advanced reasoning systems.

Revenue Models

Licensing of improved VLM modelsdevelopment of specialized multimodal AI solutions.

Resource Requirements

Compute Needs

High, requires significant computational resources for training large VLMs.

Data Requirements

Requires initial annotated data (e.g., 20k samples) and then leverages self-generated preference data for further improvement.

Deployment Constraints

The effectiveness of self-correction depends on the quality of the self-correction objective and the ability to generate meaningful preference data.

Scalability

Scales with the underlying VLM architecture and the effectiveness of the self-correction mechanism.

Regulatory Considerations

Highespecially for applications in safety-critical domains like robotics or autonomous driving.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate, for the self-correction and self-improvement training methodology.

View Full Paper Back to Papers