Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 92% Match Research Paper Robotics Researchers,AI Researchers,Embodied AI Developers,Computer Vision Engineers 1 day ago

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

robotics › embodied-agents
📄 Abstract

Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
Authors (8)
Yuncong Yang
Jiageng Liu
Zheyuan Zhang
Siyuan Zhou
Reuben Tan
Jianwei Yang
+2 more
Submitted
July 16, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

MindJourney is a test-time scaling framework that equips VLMs with spatial reasoning capabilities by coupling them with a video diffusion-based world model. It allows the VLM to iteratively explore a scene by sketching camera trajectories and synthesizing views, enabling reasoning over multi-view evidence without fine-tuning, significantly boosting performance on spatial reasoning benchmarks.

Business Value

Enhances the ability of AI agents (robots, virtual assistants) to understand and interact with 3D environments, crucial for tasks like autonomous driving, robotics manipulation, and immersive virtual experiences. This leads to more capable and safer AI systems in complex physical or simulated spaces.