Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Spatial reasoning ability is crucial for Vision Language Models (VLMs) to
support real-world applications in diverse domains including robotics,
augmented reality, and autonomous navigation. Unfortunately, existing
benchmarks are inadequate in assessing spatial reasoning ability, especially
the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of
human spatial cognition. In this paper, we propose a unified benchmark,
\textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that
categorizes tasks into four fundamental quadrants:
\textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic,
\textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover,
to address the issue of data scarcity, we develop a scalable and automated
pipeline to generate diverse and verifiable spatial reasoning questions,
resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE
Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA
pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals
that, current VLMs have a large and consistent gap to human competence,
especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a
robust framework, valuable dataset, and clear direction for future research
toward human-like spatial intelligence. Benchmark, dataset, and code will be
publicly released.
Authors (8)
Xinmiao Huang
Qisong He
Zhenglin Huang
Boxuan Wang
Zhuoyun Li
Guangliang Cheng
+2 more
Submitted
October 15, 2025
Key Contributions
Introduces Spatial-DISE, a unified benchmark for evaluating spatial reasoning in Vision-Language Models (VLMs), based on a cognitively grounded taxonomy (Intrinsic-Static, Intrinsic-Dynamic, Extrinsic-Static, Extrinsic-Dynamic). It also presents a scalable pipeline for generating diverse spatial reasoning questions and a new dataset (Spatial-DISE-12K).
Business Value
Provides essential tools for developing and validating AI systems that require sophisticated spatial understanding, critical for applications like autonomous navigation and AR/VR.