Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-Language-Action (VLA) models are increasingly used for end-to-end
driving due to their world knowledge and reasoning ability. Most prior work,
however, inserts textual chains-of-thought (CoT) as intermediate steps tailored
to the current scene. Such symbolic compressions can blur spatio-temporal
relations and discard fine visual cues, creating a cross-modal gap between
perception and planning. We propose FSDrive, a visual spatio-temporal CoT
framework that enables VLAs to think in images. The model first acts as a world
model to generate a unified future frame that overlays coarse but
physically-plausible priors-future lane dividers and 3D boxes-on the predicted
future image. This unified frame serves as the visual CoT, capturing both
spatial structure and temporal evolution. The same VLA then functions as an
inverse-dynamics model, planning trajectories from current observations and the
visual CoT. To equip VLAs with image generation while preserving understanding,
we introduce a unified pre-training paradigm that expands the vocabulary to
include visual tokens and jointly optimizes VQA (for semantics) and
future-frame prediction (for dynamics). A progressive easy-to-hard scheme first
predicts lane/box priors to enforce physical constraints, then completes full
future frames for fine details. On nuScenes and NAVSIM, FSDrive improves
trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics,
and attains competitive FID for future-frame generation despite using
lightweight autoregression. It also advances scene understanding on DriveLM.
Together, these results indicate that visual CoT narrows the cross-modal gap
and yields safer, more anticipatory planning. Code is available at
https://github.com/MIV-XJTU/FSDrive.
Authors (8)
Shuang Zeng
Xinyuan Chang
Mengwei Xie
Xinran Liu
Yifan Bai
Zheng Pan
+2 more
Key Contributions
Proposes FSDrive, a visual spatio-temporal CoT framework for autonomous driving VLAs. It generates a unified future frame (visual CoT) that overlays coarse priors onto a predicted future image, enabling the VLA to reason visually about spatial structure and temporal evolution for trajectory planning.
Business Value
Enhances the safety and reliability of autonomous driving systems by improving their predictive and planning capabilities, leading to more human-like driving behavior.