Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 85% Match Research Paper Autonomous driving engineers,Robotics researchers,AI researchers in VLA models 1 week ago

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

robotics › navigation
📄 Abstract

Abstract: Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.
Authors (8)
Shuang Zeng
Xinyuan Chang
Mengwei Xie
Xinran Liu
Yifan Bai
Zheng Pan
+2 more
Submitted
May 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Proposes FSDrive, a visual spatio-temporal CoT framework for autonomous driving VLAs. It generates a unified future frame (visual CoT) that overlays coarse priors onto a predicted future image, enabling the VLA to reason visually about spatial structure and temporal evolution for trajectory planning.

Business Value

Enhances the safety and reliability of autonomous driving systems by improving their predictive and planning capabilities, leading to more human-like driving behavior.