Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent breakthroughs in reasoning language models have significantly advanced
text-based reasoning. On the other hand, Multi-modal Large Language Models
(MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading
these is often prohibitively expensive, as it requires complete vision-language
alignment retraining which is costly. To address this issue, we introduce
Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning
component and makes it easily replaceable. This approach redefines the MLLM's
role to convert multi-modal inputs into detailed textual outputs that can be
processed by any powerful, external, text-only LLM reasoners. To align the
MLLM's perceptual output with the final reasoning task, we propose a novel
reinforcement learning algorithm called Visual Perception Optimization (VPO).
VPO rewards the MLLM based on the correctness of answers generated by the
external reasoner to produce faithful and query-relevant captions. Together,
this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon
Decoupling (RAPID) approach. Empirical results show that RAPID achieves
significant performance gains on multi-modal reasoning benchmarks. Crucially,
RAPID enables a novel inference-time scaling paradigm: Once trained with VPO,
the MLLM can be paired with any state-of-the-art LLM reasoner for consistent
performance improvement without retraining.
Authors (8)
Yunhao Gou
Kai Chen
Zhili Liu
Lanqing Hong
Xin Jin
Zhenguo Li
+2 more
Key Contributions
Introduces Perception-Reasoning Decoupling, a modular approach to MLLMs that separates perception from reasoning, allowing the use of powerful external text-only LLMs. It proposes Visual Perception Optimization (VPO), an RL algorithm that rewards MLLMs based on the correctness of answers from the external reasoner, ensuring faithful and query-relevant captions.
Business Value
Allows companies to leverage the latest advancements in LLMs for multimodal tasks without the prohibitive cost of full MLLM retraining, making advanced AI more accessible and adaptable.