Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal large language models (MLLMs) have shown strong capabilities
across a broad range of benchmarks. However, most existing evaluations focus on
passive inference, where models perform step-by-step reasoning under complete
information. This setup is misaligned with real-world use, where seeing is not
enough. This raises a fundamental question: Can MLLMs actively acquire missing
evidence under incomplete information? To bridge this gap, we require the MLLMs
to actively acquire missing evidence and iteratively refine decisions under
incomplete information, by selecting a target image from a candidate pool
without task-specific priors. To support systematic study, we propose
GuessBench, a benchmark with both perception-oriented and knowledge-oriented
images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs
and find that performance on active reasoning lags far behind it on passive
settings, indicating substantial room for improvement. Further analysis
identifies fine-grained perception and timely decision-making as key
challenges. Ablation studies show that perceptual enhancements benefit smaller
models, whereas thinking-oriented methods provide consistent gains across model
sizes. These results suggest promising directions for future research on
multimodal active reasoning.
Authors (6)
Hongcheng Liu
Pingjie Wang
Yuhao Wang
Siqu Ou
Yanfeng Wang
Yu Wang
Submitted
October 17, 2025
Key Contributions
This paper investigates the limits of active reasoning in Multimodal LLMs (MLLMs) by proposing GuessBench, a benchmark designed to evaluate their ability to actively acquire missing evidence under incomplete information. Evaluations of 20 MLLMs show that active reasoning performance lags significantly behind passive inference, indicating substantial room for improvement.
Business Value
Drives the development of more capable and adaptable AI systems that can function effectively in dynamic, real-world environments by actively seeking information.