arxiv_cv 85% Match Research Paper Autonomous driving engineers,Robotics researchers,AI researchers in VLA models 1 week ago

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

robotics › navigation

📄 Abstract

Abstract: Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.

Authors (8)

Shuang Zeng

Xinyuan Chang

Mengwei Xie

Xinran Liu

Yifan Bai

Zheng Pan

+2 more

Submitted

May 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes FSDrive, a visual spatio-temporal CoT framework for autonomous driving VLAs. It generates a unified future frame (visual CoT) that overlays coarse priors onto a predicted future image, enabling the VLA to reason visually about spatial structure and temporal evolution for trajectory planning.

Business Value

Enhances the safety and reliability of autonomous driving systems by improving their predictive and planning capabilities, leading to more human-like driving behavior.

Paper Metadata

Innovation Type

Algorithmic and Architectural

Deployment Feasibility

Moderate, requires integration into complex autonomous driving stacks and significant computational resources.

Limitations Addressed

Cross-modal gap between perception and planning in VLAs due to textual CoT, loss of spatio-temporal relations and fine visual cues.

Performance Gains

Enables VLAs to 'think in images', improving their ability to capture spatio-temporal relations for more robust driving decisions.

Technical Tags

autonomous drivingvision-language-action (VLA)spatio-temporal CoTchain-of-thoughtvisual reasoningworld modelinverse dynamics modelfuture frame prediction

Research Topics

Autonomous DrivingRoboticsComputer VisionNatural Language ProcessingReinforcement Learning

Methods & Architectures

Visual Spatio-Temporal Chain-of-Thought (CoT)World ModelInverse Dynamics Model Vision-Language-Action (VLA) models

Applications & Tasks

Autonomous Driving Robotics Intelligent Transportation Systems Decision MakingPlanningPerceptionReasoning End-to-end driving using visual spatio-temporal reasoning

Related Fields

AI SafetyReinforcement LearningPredictive Modeling

Keywords

autonomous drivingVLAspatio-temporalchain-of-thoughtvisual reasoningworld modelplanningroboticsperceptionfuture prediction

Academic Context

#Autonomous Driving#Robotics#Computer Vision#Natural Language Processing#Reinforcement Learning

Commercial Potential

Potential Products

Advanced driver-assistance systems (ADAS)Full self-driving (FSD) softwareRobotic control systems

Target Industries

AutomotiveTechnologyLogisticsRobotics

Use Case Examples

Navigating complex urban intersectionsPredicting pedestrian behaviorPlanning safe trajectories in dynamic environments

Competitive Edge

Offers a novel visual reasoning approach for VLAs in autonomous driving, overcoming limitations of textual CoT methods.

Market Opportunity

Massive market for autonomous driving technology.

Revenue Models

Licensing of autonomous driving softwareintegration into vehicle platforms.

Resource Requirements

Compute Needs

High, requires significant GPU resources for training and real-time inference.

Data Requirements

Large-scale driving datasets with sensor data and driving actions.

Deployment Constraints

Real-time performance requirements, safety validation, computational cost.

Scalability

Scalable with efficient model architectures and distributed computing.

Regulatory Considerations

Automotive safety standardsAutonomous driving regulations

Production Readiness

Maturity Level

Research

Time to Market

3-5 years

Patent Potential

Moderate to high, for the visual CoT framework and its integration into driving systems.

View Full Paper Back to Papers