arxiv_cv 95% Match Research Paper AI researchers,ML engineers,Developers of multimodal AI systems,Companies leveraging LLMs 2 weeks ago

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

large-language-models › multimodal-llms

📄 Abstract

Abstract: Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

Authors (8)

Yunhao Gou

Kai Chen

Zhili Liu

Lanqing Hong

Xin Jin

Zhenguo Li

+2 more

Submitted

June 5, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Perception-Reasoning Decoupling, a modular approach to MLLMs that separates perception from reasoning, allowing the use of powerful external text-only LLMs. It proposes Visual Perception Optimization (VPO), an RL algorithm that rewards MLLMs based on the correctness of answers from the external reasoner, ensuring faithful and query-relevant captions.

Business Value

Allows companies to leverage the latest advancements in LLMs for multimodal tasks without the prohibitive cost of full MLLM retraining, making advanced AI more accessible and adaptable.

Paper Metadata

Innovation Type

Modular architecture and RL-based alignment method

Deployment Feasibility

High. The modular design simplifies upgrades and integration. Inference requires both the perception module and an external LLM.

Limitations Addressed

High cost of retraining MLLMs when upgrading internal LLMs; outdated internal LLMs hindering MLLM performance; misalignment between perception and reasoning components.

Performance Gains

Enables upgrading MLLMs by simply swapping external LLM reasoners, significantly reducing retraining costs while improving reasoning performance.

Technical Tags

Reasoning-Aligned Perception DecouplingMulti-modal Large Language Models (MLLMs)Perception-Reasoning DecouplingVisual Perception Optimization (VPO)Reinforcement Learning (RL)Text-only LLM reasonersModular MLLMsQuery-relevant captionsFaithful captionsScalable MLLMs

Research Topics

Multimodal ReasoningLarge Language ModelsModular AI ArchitecturesReinforcement Learning for Alignment

Methods & Architectures

Perception-Reasoning Decoupling frameworkVisual Perception Optimization (VPO) algorithmReinforcement LearningModular MLLM designExternal text-only LLM reasoners Modular Multimodal Large Language Models (MLLMs)

Applications & Tasks

AI assistants Information retrieval Content analysis Robotics ReasoningMultimodal understandingCaptioningAlignment Enabling scalable multi-modal reasoningUpgrading MLLMs without full retrainingAligning perceptual output with reasoning tasksGenerating faithful and query-relevant captions

Related Fields

Large Language ModelsMultimodal AIReinforcement LearningComputer VisionNatural Language Processing

Keywords

Multimodal reasoningLarge language modelsModular AIPerception-Reasoning DecouplingReinforcement learningVisual Perception OptimizationVPOAlignmentScalabilityMLLMsVision-language modelsCaptioning

Academic Context

#Multimodal Reasoning#Large Language Models#Modular AI Architectures#Reinforcement Learning for Alignment

Commercial Potential

Potential Products

Upgradable multimodal AI platformsAI assistants with enhanced reasoningContent analysis tools for mixed media

Target Industries

TechnologyMediaE-commerceCustomer ServiceHealthcare

Use Case Examples

An AI assistant that can answer complex questions about images and textSystems that automatically generate detailed reports from visual and textual dataRobots that can understand and reason about their visual environment

Competitive Edge

Offers a more flexible and cost-effective approach to improving MLLMs compared to end-to-end retraining, enabling easier adoption of new LLM advancements.

Market Opportunity

Massive and growing market for LLMs and multimodal AI applications.

Revenue Models

Licensing of the modular frameworkAPI access to enhanced multimodal reasoning services.

Resource Requirements

Compute Needs

Moderate for perception module, high for external LLM inference.

Data Requirements

Datasets suitable for multimodal reasoning tasks and alignment training.

Deployment Constraints

Requires managing two components (perception module and external LLM); potential latency issues.

Scalability

Highly scalable due to modularity; upgrading the LLM component is straightforward.

Regulatory Considerations

Ethical use of AI in decision-makingdata privacy.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for integration into advanced AI products.

Patent Potential

Potential for patents on the Perception-Reasoning Decoupling architecture and the VPO algorithm.

View Full Paper Back to Papers