arxiv_cv 95% Match Research Paper AI researchers developing VLMs,Robotics engineers,AR/VR developers,Cognitive scientists 1 week ago

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

large-language-models › reasoning

📄 Abstract

Abstract: Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.

Authors (8)

Xinmiao Huang

Qisong He

Zhenglin Huang

Boxuan Wang

Zhuoyun Li

Guangliang Cheng

+2 more

Submitted

October 15, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Spatial-DISE, a unified benchmark for evaluating spatial reasoning in Vision-Language Models (VLMs), based on a cognitively grounded taxonomy (Intrinsic-Static, Intrinsic-Dynamic, Extrinsic-Static, Extrinsic-Dynamic). It also presents a scalable pipeline for generating diverse spatial reasoning questions and a new dataset (Spatial-DISE-12K).

Business Value

Provides essential tools for developing and validating AI systems that require sophisticated spatial understanding, critical for applications like autonomous navigation and AR/VR.

Paper Metadata

Innovation Type

Benchmark and Dataset

Deployment Feasibility

N/A (Benchmark)

Limitations Addressed

Inadequacy of existing benchmarks for assessing spatial reasoning,Lack of focus on intrinsic-dynamic spatial reasoning,Data scarcity for spatial reasoning tasks,Need for a unified framework to categorize spatial reasoning abilities

Technical Tags

Spatial ReasoningVision-Language Models (VLMs)BenchmarkCognitive TaxonomyIntrinsic-StaticIntrinsic-DynamicExtrinsic-StaticExtrinsic-DynamicDataset GenerationVQA

Research Topics

Spatial ReasoningVision-Language ModelsBenchmark DesignCognitive ScienceAI Evaluation

Methods & Architectures

Benchmark constructionCognitively grounded taxonomyAutomated dataset generation pipelineVisual Question Answering (VQA) Vision-Language Models (VLMs)

Applications & Tasks

Robotics Augmented Reality (AR) Autonomous Navigation Human-Computer Interaction Virtual Environments Evaluating spatial reasoning capabilitiesAssessing intrinsic and extrinsic spatial understandingBenchmarking dynamic spatial reasoning Evaluating VLMs on a comprehensive set of spatial reasoning tasksProviding a standardized benchmark for spatial cognition in AI

Datasets & Benchmarks

Datasets

Spatial-DISE Bench, Spatial-DISE-12K

Benchmarks

Spatial-DISE Bench (559 VQA pairs)

Accuracy (for VQA)

Related Fields

Artificial IntelligenceComputer VisionNatural Language ProcessingCognitive ScienceRobotics

Keywords

Spatial ReasoningVision-Language ModelsBenchmarkEvaluationVQACognitive ScienceRoboticsARAutonomous NavigationDatasetIntrinsicExtrinsicStaticDynamicAI Evaluation

Academic Context

#Spatial Reasoning#Vision-Language Models#Benchmark Design#Cognitive Science#AI Evaluation

Commercial Potential

Potential Products

Standardized evaluation suites for spatial reasoningDatasets for training spatial reasoning models

Target Industries

RoboticsAutonomous VehiclesAugmented RealityVirtual RealityGamingLogistics

Use Case Examples

Testing a robot's ability to navigate based on spatial descriptionsEvaluating AR systems' understanding of object placement and relationshipsBenchmarking AI models for complex scene understanding

Competitive Edge

Offers a more comprehensive and cognitively grounded benchmark than existing ones, specifically addressing dynamic and intrinsic spatial reasoning.

Market Opportunity

Significant market interest in robust AI reasoning capabilities.

Revenue Models

N/A (Benchmark)

Resource Requirements

Compute Needs

N/A (Benchmark)

Data Requirements

Diverse images and corresponding spatial reasoning questions.

Deployment Constraints

N/A (Benchmark)

Scalability

The automated pipeline suggests good scalability for dataset generation.

Production Readiness

Maturity Level

Research (Benchmark)

Time to Market

N/A (Benchmark)

Patent Potential

Low (benchmarks are typically not patented)

View Full Paper Back to Papers