Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI researchers,ML engineers,Developers of multimodal AI systems,Companies leveraging LLMs 2 weeks ago

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

large-language-models › multimodal-llms
📄 Abstract

Abstract: Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.
Authors (8)
Yunhao Gou
Kai Chen
Zhili Liu
Lanqing Hong
Xin Jin
Zhenguo Li
+2 more
Submitted
June 5, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces Perception-Reasoning Decoupling, a modular approach to MLLMs that separates perception from reasoning, allowing the use of powerful external text-only LLMs. It proposes Visual Perception Optimization (VPO), an RL algorithm that rewards MLLMs based on the correctness of answers from the external reasoner, ensuring faithful and query-relevant captions.

Business Value

Allows companies to leverage the latest advancements in LLMs for multimodal tasks without the prohibitive cost of full MLLM retraining, making advanced AI more accessible and adaptable.