Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities in vision-language understanding. Recently, with the integration
of test-time scaling techniques, these models have also shown strong potential
in visual reasoning. However, most existing reasoning approaches remain
text-level in nature: MLLMs are prompted to explore various combinations of
textual tokens via their underlying language model, while the visual input
remains fixed throughout the reasoning process. This paradigm limits the
model's ability to fully exploit rich visual information, particularly when
dealing with images containing numerous fine-grained elements. In such cases,
vision-level reasoning becomes crucial - where models dynamically zoom into
specific regions of the image to gather detailed visual cues necessary for
accurate decision-making. In this paper, we propose Zoom Eye, a training-free,
model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom
Eye treats an image as a hierarchical tree structure, where each child node
represents a zoomed-in sub-region of its parent, and the root corresponds to
the full image. The algorithm enables MLLMs to simulate human-like zooming
behavior by navigating from root to leaf nodes in search of task-relevant
visual evidence. We experiment on a series of high-resolution benchmarks and
the results demonstrate that Zoom Eye consistently improves the performance of
multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and
17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong
large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye