Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Enhancing the multimodal reasoning capabilities of Multimodal Large Language
Models (MLLMs) is a challenging task that has attracted increasing attention in
the community. Recently, several studies have applied Reinforcement Learning
with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the
reasoning abilities of MLLMs. However, these works largely overlook the
enhancement of multimodal perception capabilities in MLLMs, which serve as a
core prerequisite and foundational component of complex multimodal reasoning.
Through McNemar's test, we find that existing RLVR method fails to effectively
enhance the multimodal perception capabilities of MLLMs, thereby limiting their
further improvement in multimodal reasoning. To address this limitation, we
propose Perception-R1, which introduces a novel visual perception reward that
explicitly encourages MLLMs to perceive the visual content accurately, thereby
can effectively incentivizing both their multimodal perception and reasoning
capabilities. Specifically, we first collect textual visual annotations from
the CoT trajectories of multimodal problems, which will serve as visual
references for reward assignment. During RLVR training, we employ a judging LLM
to assess the consistency between the visual annotations and the responses
generated by MLLM, and assign the visual perception reward based on these
consistency judgments. Extensive experiments on several multimodal reasoning
benchmarks demonstrate the effectiveness of our Perception-R1, which achieves
state-of-the-art performance on most benchmarks using only 1,442 training data.