arxiv_cv 95% Match Research Paper Robotics Researchers,AI Researchers,Computer Vision Scientists,Developers of Autonomous Agents 1 week ago

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

large-language-models › multimodal-llms

📄 Abstract

Abstract: Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.

Authors (8)

Weijie Zhou

Xuantang Xiong

Yi Peng

Manli Tao

Chaoyang Zhao

Honghui Dong

+2 more

Submitted

October 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces the Active Visual Reasoning (AVR) task and CLEVR-AVR benchmark to evaluate MLLMs in partially observable, interactive environments. This moves beyond static settings by requiring agents to actively explore, integrate information sequentially, and adapt decisions based on feedback, mimicking human interaction.

Business Value

Enables the development of more capable and adaptable AI agents for real-world applications like robotics, where environments are dynamic and information is often incomplete, leading to more robust and intelligent systems.

Paper Metadata

Innovation Type

Task Formulation and Benchmark

Deployment Feasibility

Moderate. Requires significant simulation environments and integration with robotic hardware for real-world deployment. The benchmark itself is a key enabler.

Limitations Addressed

Visual reasoning limited to static, fully observable settings,Inability to handle incomplete information due to occlusion or limited field of view,Lack of active exploration and interaction

Technical Tags

Active Visual ReasoningMultimodal LLMsInteractive EnvironmentsPartial ObservabilityEmbodied AIPerception-Action LoopSequential Decision MakingSimulation BenchmarkRoboticsVisual Grounding

Research Topics

Embodied AIMultimodal ReasoningRoboticsInteractive LearningPerception

Methods & Architectures

Active Visual Reasoning (AVR) task formulationSequential physical actionsMulti-round interactionSimulation benchmark (CLEVR-AVR) Multimodal Large Language Models (MLLMs)

Applications & Tasks

Robotics Autonomous Agents Interactive AI Virtual Assistants Reasoning in partially observable environmentsActive information gatheringClosed-loop perception-action integrationVisual reasoning under occlusion Active information acquisitionIntegrating observations over timeDynamic decision adjustmentVisual reasoning in interactive settings

Datasets & Benchmarks

Datasets

CLEVR-AVR

Related Fields

RoboticsArtificial IntelligenceComputer VisionReinforcement LearningNatural Language Processing

Keywords

Active Visual ReasoningMultimodal LLMEmbodied AIInteractive EnvironmentsPartial ObservabilityRoboticsPerception-Action LoopSequential Decision MakingSimulation BenchmarkCLEVR-AVRVisual ReasoningAutonomous Agents

Academic Context

#Embodied AI#Multimodal Reasoning#Robotics#Interactive Learning#Perception

Commercial Potential

Potential Products

Robotic agents capable of exploration and interactionAdvanced virtual assistantsAI systems for complex task execution in dynamic environments

Target Industries

RoboticsAutonomous SystemsGamingVirtual RealityLogistics

Use Case Examples

Robots exploring unknown environments to map themAI agents learning to interact with objects in a simulated worldVirtual assistants that can ask clarifying questions to understand user intent

Competitive Edge

Pioneers a new paradigm for evaluating MLLMs by introducing active interaction and partial observability, moving beyond static benchmarks to assess capabilities relevant for real-world embodied AI.

Market Opportunity

Rapidly growing market for embodied AI and intelligent autonomous systems.

Revenue Models

Licensing of AI agent technologydevelopment of specialized robotic systems.

Resource Requirements

Compute Needs

High, especially for training MLLMs and running complex simulations.

Data Requirements

Requires simulation environments and potentially real-world interaction data.

Deployment Constraints

Real-time decision making, sensor integration, physical embodiment, safety considerations.

Scalability

Scalability depends on the efficiency of the MLLM and the simulation environment.

Production Readiness

Maturity Level

Research Benchmark and Task

Time to Market

3-5 years

Patent Potential

Low, primarily focused on task formulation and evaluation.

View Full Paper Back to Papers