arxiv_cl 98% Match Research Paper AI researchers,MLLM developers,Computer vision scientists,Robotics engineers 2 weeks ago

When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

large-language-models › reasoning

📄 Abstract

Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

Authors (6)

Hongcheng Liu

Pingjie Wang

Yuhao Wang

Siqu Ou

Yanfeng Wang

Yu Wang

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper investigates the limits of active reasoning in Multimodal LLMs (MLLMs) by proposing GuessBench, a benchmark designed to evaluate their ability to actively acquire missing evidence under incomplete information. Evaluations of 20 MLLMs show that active reasoning performance lags significantly behind passive inference, indicating substantial room for improvement.

Business Value

Drives the development of more capable and adaptable AI systems that can function effectively in dynamic, real-world environments by actively seeking information.

Paper Metadata

Innovation Type

Benchmark and Evaluation

Deployment Feasibility

High, as it focuses on evaluation methodology.

Limitations Addressed

Existing evaluations focus on passive inference, which is misaligned with real-world use where models need to actively acquire missing evidence under incomplete information.

Performance Gains

Reveals significant performance gaps between active reasoning and passive inference capabilities of MLLMs.

Technical Tags

Multimodal LLMs (MLLMs)Active ReasoningPassive InferenceIncomplete InformationEvidence AcquisitionGuessBenchPerception-orientedKnowledge-orientedIterative RefinementLLM Limitations

Research Topics

Multimodal AIReasoning in AILLM CapabilitiesActive LearningAI Evaluation

Methods & Architectures

Benchmark creation (GuessBench)Active reasoning evaluationPassive inference evaluationComparative analysis Multimodal Large Language Models (MLLMs)

Applications & Tasks

Computer Vision Natural Language Processing Robotics AI Agents Evaluating MLLMs' ability to actively acquire missing evidenceAssessing reasoning under incomplete informationBridging the gap between passive inference and real-world useIdentifying limitations in MLLM reasoning Active reasoningEvidence acquisitionIterative decision makingMultimodal understanding

Datasets & Benchmarks

Benchmarks

GuessBench

Active reasoning performancePassive inference performanceAccuracy in evidence acquisitionIterative decision refinement

Related Fields

Artificial IntelligenceMachine LearningComputer VisionCognitive Science

Keywords

Multimodal LLMMLLMActive ReasoningPassive InferenceIncomplete InformationEvidence AcquisitionGuessBenchBenchmarkEvaluationReasoningIterative Refinement

Academic Context

#Multimodal AI#Reasoning in AI#LLM Capabilities#Active Learning#AI Evaluation

Commercial Potential

Potential Products

More intelligent AI agentsRobots capable of exploration and problem-solvingAdvanced diagnostic systems

Target Industries

TechnologyRoboticsHealthcareAutonomous Systems

Use Case Examples

Robots that can explore unknown environments and gather informationAI systems that can ask clarifying questions to resolve ambiguityDiagnostic tools that actively seek relevant patient data

Competitive Edge

Identifies a critical gap in MLLM capabilities (active reasoning) and provides a benchmark to drive progress in this area.

Resource Requirements

Compute Needs

Moderate to High, for running MLLMs during evaluation.

Data Requirements

A benchmark dataset (GuessBench) with images and reasoning tasks.

Scalability

The benchmark methodology is scalable to new MLLMs and different reasoning tasks.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers