arxiv_cv 90% Match Research Paper Autonomous driving researchers,Robotics engineers,AI researchers,ML engineers,Automotive industry professionals 1 week ago

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

robotics › navigation

📄 Abstract

Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

Authors (7)

Zewei Zhou

Tianhui Cai

Seth Z. Zhao

Yun Zhang

Zhiyu Huang

Bolei Zhou

+1 more

Submitted

June 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes AutoVLA, a novel Vision-Language-Action (VLA) model for end-to-end autonomous driving that unifies reasoning and action generation within a single autoregressive model. It addresses limitations of existing VLA models by enabling direct semantic reasoning and trajectory planning from visual inputs and language, incorporating adaptive reasoning modes (fast/slow) and reinforcement fine-tuning.

Business Value

Advances the development of safer and more intelligent autonomous driving systems, potentially reducing accidents, improving traffic flow, and enabling new mobility services.

Paper Metadata

Innovation Type

Model Architecture and Training Methodology

Deployment Feasibility

Moderate to high, depending on the robustness and safety validation in real-world driving scenarios. Simulation-based training is a strong starting point.

Limitations Addressed

Physically infeasible action outputs, overly complex structures, and inefficient reasoning in current VLA models for autonomous driving. AutoVLA aims for more direct, feasible, and efficient control.

Technical Tags

Vision-Language-Action (VLA) modelsend-to-end autonomous drivingadaptive reasoningreinforcement fine-tuningsemantic reasoningtrajectory planningchain-of-thought reasoningdiscrete actionssupervised fine-tuningdriving simulation

Research Topics

Autonomous DrivingRobotics ControlVision-Language IntegrationAI ReasoningReinforcement Learning in Robotics

Methods & Architectures

End-to-end VLA model (AutoVLA)Supervised fine-tuning (dual thinking modes)Reinforcement fine-tuningTokenization of continuous trajectoriesChain-of-thought reasoning integration Vision-Language-Action (VLA) ModelsAutoregressive models

Applications & Tasks

Autonomous Driving Robotics Intelligent Transportation Systems Physically infeasible action outputs in VLA modelsComplex model structures and unnecessary reasoningAchieving end-to-end driving control with language instructions End-to-end autonomous drivingSemantic reasoning for drivingTrajectory planningAction generation

Related Fields

Autonomous DrivingRoboticsArtificial IntelligenceMachine LearningComputer VisionNatural Language ProcessingReinforcement Learning

Keywords

Autonomous DrivingVision-Language-ActionVLA ModelEnd-to-End LearningReasoningTrajectory PlanningReinforcement LearningSupervised LearningAutoregressive ModelDriving SimulationChain-of-ThoughtAdaptive Reasoning

Academic Context

#Autonomous Driving#Robotics Control#Vision-Language Integration#AI Reasoning#Reinforcement Learning in Robotics

Commercial Potential

Potential Products

End-to-end autonomous driving softwareAdvanced driver-assistance systems (ADAS)Robotic control systems for vehicles

Target Industries

AutomotiveTechnologyTransportationLogistics

Use Case Examples

Enabling self-driving cars to navigate complex urban environments based on visual input and instructionsDeveloping AI co-pilots that can understand and respond to driver commandsCreating robotic systems that can perform tasks requiring visual perception and action planning

Competitive Edge

Offers a unified VLA architecture with adaptive reasoning and reinforcement fine-tuning for end-to-end autonomous driving, aiming for more direct and feasible action generation compared to modular approaches.

Market Opportunity

Massive market for autonomous driving technology.

Revenue Models

Licensing of autonomous driving softwareintegration into vehicle platformsmobility-as-a-service.

Resource Requirements

Compute Needs

High for training and reinforcement fine-tuning, moderate for inference.

Data Requirements

Requires large-scale driving datasets (real or simulated) with corresponding actions and potentially language instructions.

Deployment Constraints

Safety validation, real-world testing, regulatory compliance, computational resources for real-time inference.

Scalability

Scalability depends on the model architecture and the ability to generalize to diverse driving scenarios.

Regulatory Considerations

Automotive safety standardsroad regulationsethical considerations for autonomous systems.

Production Readiness

Maturity Level

Research

Time to Market

3-7 years for full Level 4/5 autonomous driving deployment.

View Full Paper Back to Papers