Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Safe and feasible trajectory planning is critical for real-world autonomous
driving systems. However, existing learning-based planners rely heavily on
expert demonstrations, which not only lack explicit safety awareness but also
risk inheriting undesirable behaviors such as speeding from suboptimal human
driving data. Inspired by the success of large language models, we propose
Plan-R1, a two-stage trajectory planning framework that decouples principle
alignment from behavior learning. In the first stage, a general trajectory
predictor is pre-trained on expert data to capture diverse, human-like driving
behaviors. In the second stage, the model is fine-tuned with rule-based rewards
using Group Relative Policy Optimization (GRPO), explicitly aligning ego
planning with principles such as safety, comfort, and traffic rule compliance.
This two-stage paradigm retains human-like behaviors while enhancing safety
awareness and discarding undesirable patterns from demonstrations. Furthermore,
we identify a key limitation of directly applying GRPO to planning: group-wise
normalization erases cross-group scale differences, causing rare, high-variance
safety-violation groups to have similar advantages as abundant low-variance
safe groups, thereby suppressing optimization for safety-critical objectives.
To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces
normalization with centering and fixed scaling to preserve absolute reward
magnitudes, ensuring that safety-critical objectives remain dominant throughout
training. Experiments on the nuPlan benchmark demonstrate that Plan-R1
significantly improves planning safety and feasibility, achieving
state-of-the-art performance, particularly in realistic reactive settings. Our
code is available at https://github.com/XiaolongTang23/Plan-R1.