Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Chain-of-thought reasoning has significantly improved the performance of
Large Language Models (LLMs) across various domains. However, this reasoning
process has been confined exclusively to textual space, limiting its
effectiveness in visually intensive tasks. To address this limitation, we
introduce the concept of reasoning in the pixel-space. Within this novel
framework, Vision-Language Models (VLMs) are equipped with a suite of visual
reasoning operations, such as zoom-in and select-frame. These operations enable
VLMs to directly inspect, interrogate, and infer from visual evidences, thereby
enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space
reasoning capabilities in VLMs presents notable challenges, including the
model's initially imbalanced competence and its reluctance to adopt the newly
introduced pixel-space operations. We address these challenges through a
two-phase training approach. The first phase employs instruction tuning on
synthesized reasoning traces to familiarize the model with the novel visual
operations. Following this, a reinforcement learning (RL) phase leverages a
curiosity-driven reward scheme to balance exploration between pixel-space
reasoning and textual reasoning. With these visual operations, VLMs can
interact with complex visual inputs, such as information-rich images or videos
to proactively gather necessary information. We demonstrate that this approach
significantly improves VLM performance across diverse visual reasoning
benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on
TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy
achieved by any open-source model to date. These results highlight the
importance of pixel-space reasoning and the effectiveness of our framework.
Authors (5)
Haozhe Wang
Alex Su
Weiming Ren
Fangzhen Lin
Wenhu Chen
Key Contributions
Introduces 'Pixel Reasoner', a framework that enables Vision-Language Models (VLMs) to perform reasoning directly in the pixel-space using visual operations like zoom-in and select-frame. It employs curiosity-driven reinforcement learning and a two-phase training approach to cultivate these capabilities, enhancing reasoning fidelity for visual tasks.
Business Value
Enables more sophisticated visual understanding and reasoning in AI systems, leading to advancements in areas like autonomous driving, medical image analysis, and content moderation.