arxiv_ai 88% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Scientists,Machine Learning Engineers 1 week ago

Latent Chain-of-Thought for Visual Reasoning

large-language-models › multimodal-llms

📄 Abstract

Abstract: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

Authors (8)

Guohao Sun

Hang Hua

Jian Wang

Jiebo Luo

Sohail Dianat

Majid Rabbani

+2 more

Submitted

October 27, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper proposes 'Latent Chain-of-Thought' (Latent CoT) for Large Vision-Language Models (LVLMs), reformulating reasoning as posterior inference and introducing a scalable training algorithm based on amortized variational inference. It uses diversity-seeking RL with a sparse reward function to encourage diverse latent CoTs and a Bayesian inference-scaling strategy for efficient ranking, improving generalization and avoiding reward hacking.

Business Value

Enhances the trustworthiness and capability of AI systems that interpret visual information and reason about it, leading to more reliable applications in areas like autonomous driving, medical image analysis, and content moderation.

Paper Metadata

Innovation Type

Algorithmic / Training Method

Deployment Feasibility

Feasible, but requires advanced ML expertise for implementation and training. The proposed methods aim for scalability.

Limitations Addressed

Addresses limitations of existing LVLM training algorithms (SFT, PPO, GRPO) in generalizing across unseen reasoning tasks and their reliance on biased reward models. It overcomes deterministic sampling limitations and reward hacking.

Performance Gains

Empirically demonstrates enhanced reasoning capabilities and generalization across unseen tasks compared to existing methods.

Technical Tags

visual reasoningchain-of-thoughtlarge vision-language modelsvariational inferencereinforcement learningreward functionamortized inferenceBayesian inference

Research Topics

Multimodal AIVision-Language ModelsReasoningMachine LearningDeep Learning

Methods & Architectures

Amortized Variational InferenceDiversity-seeking Reinforcement LearningSparse reward functionBayesian inference-scaling strategyLatent Chain-of-Thought (CoT) Large Vision-Language Models (LVLMs)Variational Autoencoders (VAEs)

Applications & Tasks

Computer Vision Natural Language Processing Robotics Image Understanding ReasoningInferenceGenerationLearning Improving interpretability and reliability of LVLMsGenerating diverse and high-likelihood latent CoTEfficiently ranking rationales and answers

Related Fields

Computer VisionNatural Language ProcessingMachine LearningDeep LearningProbabilistic Graphical Models

Keywords

visual reasoningchain-of-thoughtLVLMvariational inferencereinforcement learningreward functionamortized inferenceBayesian inferencemultimodal AIinterpretabilitygenerative modelslatent variables

Academic Context

#Multimodal AI#Vision-Language Models#Reasoning#Machine Learning#Deep Learning

Commercial Potential

Potential Products

More reliable AI assistants for visual tasksAdvanced image analysis toolsRobotic vision systems

Target Industries

TechnologyAutomotiveHealthcareRoboticsSecurity

Use Case Examples

AI systems that can describe complex visual scenes with reasoningAutonomous vehicles that understand and react to visual environmentsMedical diagnostic tools that explain their reasoning based on images

Competitive Edge

Offers a novel training paradigm for LVLMs that improves reasoning capabilities and generalization beyond current SFT/RLHF methods, focusing on robust and interpretable latent representations.

Resource Requirements

Compute Needs

High, particularly for training the LVLM and the RL components.

Data Requirements

Requires large-scale multimodal datasets with visual and textual components suitable for reasoning tasks.

Deployment Constraints

Computational cost of inference for complex reasoning chains.

Scalability

The proposed methods aim for scalability through amortized inference and efficient ranking strategies.

View Full Paper Back to Papers