Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Engineers,MLLM Developers,Robotics Engineers 2 weeks ago

VAR: Visual Attention Reasoning via Structured Search and Backtracking

large-language-models › multimodal-llms
📄 Abstract

Abstract: Multimodal Large Language Models (MLLMs), despite their advances, are hindered by their high hallucination tendency and heavy reliance on brittle, linear reasoning processes, leading to failures in complex tasks. To address these limitations, we introduce Visual Attention Reasoning (VAR), a novel framework that recasts grounded reasoning as a structured search over a reasoning trajectory space. VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought (CoT) generation, which incorporates a backtracking mechanism for self-correction. The search is guided by a multi-faceted reward function with semantic and geometric self-verification components, which penalize outputs that are not faithfully grounded in the visual input. We provide a theoretical analysis for our search strategy, validating its capability to find the correct solution with high probability. Experimental results show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks, significantly outperforming existing open-source models and demonstrating competitive performance against leading proprietary systems.
Authors (8)
Wei Cai
Jian Zhao
Yuchen Yuan
Tianle Zhang
Ming Zhu
Haichuan Tang
+2 more
Submitted
October 21, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

Introduces Visual Attention Reasoning (VAR), a novel framework for grounded reasoning in MLLMs that recasts the process as structured search with backtracking. VAR decomposes reasoning into evidence grounding and CoT generation, guided by a multi-faceted reward function for self-correction, significantly reducing hallucination and improving performance on complex visual tasks.

Business Value

Enhances the reliability and trustworthiness of AI systems that interpret visual information, crucial for applications like autonomous driving, medical diagnosis, and content moderation. Reduces errors and improves user trust.