arxiv_cv 92% Match Research Paper Robotics Researchers,AI Researchers,Embodied AI Developers,Computer Vision Engineers 1 day ago

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

robotics › embodied-agents

📄 Abstract

Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Authors (8)

Yuncong Yang

Jiageng Liu

Zheyuan Zhang

Siyuan Zhou

Reuben Tan

Jianwei Yang

+2 more

Submitted

July 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

MindJourney is a test-time scaling framework that equips VLMs with spatial reasoning capabilities by coupling them with a video diffusion-based world model. It allows the VLM to iteratively explore a scene by sketching camera trajectories and synthesizing views, enabling reasoning over multi-view evidence without fine-tuning, significantly boosting performance on spatial reasoning benchmarks.

Business Value

Enhances the ability of AI agents (robots, virtual assistants) to understand and interact with 3D environments, crucial for tasks like autonomous driving, robotics manipulation, and immersive virtual experiences. This leads to more capable and safer AI systems in complex physical or simulated spaces.

Paper Metadata

Innovation Type

Framework/Algorithmic

Deployment Feasibility

Moderate. Requires integrating a VLM with a separate world model (video diffusion), which can be computationally intensive. However, it operates at test-time, avoiding retraining costs.

Limitations Addressed

State-of-the-art VLMs struggle with spatial reasoning in 3D,VLMs lack an internal model of 3D dynamics,Difficulty in predicting scene appearance after egocentric motion

Performance Gains

Average 7.7% performance boost on SAT benchmark

Technical Tags

spatial reasoningembodied AIvision-language models (VLMs)world modelstest-time scalingvideo diffusion3D dynamicsegocentric motioncamera trajectorymulti-view synthesis

Research Topics

Embodied AISpatial ReasoningVision-Language ModelsRoboticsAI for 3D Environments

Methods & Architectures

MindJourney frameworkTest-time scalingCoupling VLM with a controllable world model (video diffusion)Iterative camera trajectory sketchingMulti-view synthesis Vision-Language Models (VLMs)Video Diffusion ModelsWorld Models

Applications & Tasks

Robotics Autonomous Navigation Virtual Reality 3D Scene Understanding Struggles of VLMs with spatial reasoning in 3DLack of internal models of 3D dynamics in VLMsPredicting scene appearance after egocentric motionBridging the gap between 2D perception and 3D understanding Spatial reasoningNavigation3D scene understandingPredicting future states in 3D environments

Datasets & Benchmarks

Datasets

SAT

Benchmarks

SAT: >7.7% performance boost

Related Fields

RoboticsComputer VisionArtificial IntelligenceReinforcement Learning3D Graphics

Keywords

spatial reasoningembodied AIvision-language modelworld modeltest-time adaptationvideo diffusion3D understandingroboticsnavigationscene reconstructiongenerative modelsautonomous systems

Academic Context

#Embodied AI#Spatial Reasoning#Vision-Language Models#Robotics#AI for 3D Environments

Commercial Potential

Potential Products

More capable autonomous robotsAdvanced virtual agents for simulationsTools for 3D scene understanding and reconstruction

Target Industries

RoboticsAutomotive (Autonomous Driving)GamingVirtual RealityLogistics

Use Case Examples

A robot navigating a complex warehouse by predicting how its view changes with movementAn AI assistant understanding spatial relationships in a 3D modelImproving the realism of virtual environments by enabling agents to reason about 3D space

Competitive Edge

Offers a novel approach to enhance VLM spatial reasoning by integrating world models at test-time, providing a significant performance boost without requiring model fine-tuning.

Resource Requirements

Compute Needs

High, due to the use of large VLMs and video diffusion models for synthesis.

Data Requirements

Requires datasets suitable for spatial reasoning tasks and potentially large video datasets for training the world model.

Deployment Constraints

Computational cost at inference time. Integration complexity between VLM and world model.

Scalability

Scalability depends on the efficiency of the VLM and the video diffusion model. Test-time operation might limit real-time applications requiring very low latency.

View Full Paper Back to Papers