Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise
for end-to-end autonomous driving by leveraging world knowledge and reasoning
capabilities. However, current VLA models often struggle with physically
infeasible action outputs, complex model structures, or unnecessarily long
reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies
reasoning and action generation within a single autoregressive generation model
for end-to-end autonomous driving. AutoVLA performs semantic reasoning and
trajectory planning directly from raw visual inputs and language instructions.
We tokenize continuous trajectories into discrete, feasible actions, enabling
direct integration into the language model. For training, we employ supervised
fine-tuning to equip the model with dual thinking modes: fast thinking
(trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning).
To further enhance planning performance and efficiency, we introduce a
reinforcement fine-tuning method based on Group Relative Policy Optimization
(GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive
experiments across real-world and simulated datasets and benchmarks, including
nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of
AutoVLA in both open-loop and closed-loop settings. Qualitative results
showcase the adaptive reasoning and accurate planning capabilities of AutoVLA
in diverse scenarios.
Authors (7)
Zewei Zhou
Tianhui Cai
Seth Z. Zhao
Yun Zhang
Zhiyu Huang
Bolei Zhou
+1 more
Key Contributions
This paper proposes AutoVLA, a novel Vision-Language-Action (VLA) model for end-to-end autonomous driving that unifies reasoning and action generation within a single autoregressive model. It addresses limitations of existing VLA models by enabling direct semantic reasoning and trajectory planning from visual inputs and language, incorporating adaptive reasoning modes (fast/slow) and reinforcement fine-tuning.
Business Value
Advances the development of safer and more intelligent autonomous driving systems, potentially reducing accidents, improving traffic flow, and enabling new mobility services.