Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The limited capacity for fine-grained visual perception presents a critical
bottleneck for Vision-Language Models (VLMs) in real-world applications.
Addressing this is challenging due to the scarcity of high-quality data and the
limitations of existing methods: supervised fine-tuning (SFT) often compromises
general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual
reasoning over visual perception. To bridge this gap, we propose a novel
two-stage task that structures visual perception learning as a coarse-to-fine
progressive process. Based on this task formulation, we develop ViPER, a
self-bootstrapping framework specifically designed to enable iterative
evolution through self-critiquing and self-prediction. By synergistically
integrating image-level and instance-level reconstruction with a two-stage
reinforcement learning strategy, ViPER establishes a closed-loop training
paradigm, where internally synthesized data directly fuel the enhancement of
perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the
Qwen-Viper series. With an average gain of 1.7% on seven comprehensive
benchmarks spanning various tasks and up to 6.0% on fine-grained perception,
Qwen-Viper consistently demonstrates superior performance across different
vision-language scenarios while maintaining generalizability. Beyond enabling
self-improvement in perceptual capabilities, ViPER provides concrete evidence
for the reciprocal relationship between generation and understanding, a
breakthrough to developing more autonomous and capable VLMs.
Authors (11)
Juntian Zhang
Song Jin
Chuanqi Cheng
Yuhan Liu
Yankai Lin
Xun Zhang
+5 more
Submitted
October 28, 2025
Key Contributions
ViPER proposes a novel two-stage task for structuring visual perception learning as a coarse-to-fine process and introduces a self-bootstrapping framework that enables iterative evolution through self-critiquing and self-prediction. It uses a closed-loop training paradigm integrating image-level and instance-level reconstruction with a two-stage RL strategy to enhance perceptual abilities.
Business Value
Leads to more capable and versatile VLMs that can better understand and interact with the visual world, enhancing applications like AI assistants, autonomous systems, and visual search.