Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Robotics Researchers,AI Researchers,Computer Vision Scientists,Developers of Autonomous Agents 1 week ago

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

large-language-models › multimodal-llms
📄 Abstract

Abstract: Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.
Authors (8)
Weijie Zhou
Xuantang Xiong
Yi Peng
Manli Tao
Chaoyang Zhao
Honghui Dong
+2 more
Submitted
October 24, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces the Active Visual Reasoning (AVR) task and CLEVR-AVR benchmark to evaluate MLLMs in partially observable, interactive environments. This moves beyond static settings by requiring agents to actively explore, integrate information sequentially, and adapt decisions based on feedback, mimicking human interaction.

Business Value

Enables the development of more capable and adaptable AI agents for real-world applications like robotics, where environments are dynamic and information is often incomplete, leading to more robust and intelligent systems.