Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling (2505.17659v2)

Published 23 May 2025 in cs.RO and cs.CV

Abstract: Safe and feasible trajectory planning is essential for real-world autonomous driving systems. However, existing learning-based planning methods often rely on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting unsafe behaviors such as speeding from suboptimal human driving data. Inspired by the success of LLMs, we propose Plan-R1, a novel two-stage trajectory planning framework that formulates trajectory planning as a sequential prediction task, guided by explicit planning principles such as safety, comfort, and traffic rule compliance. In the first stage, we train an autoregressive trajectory predictor via next motion token prediction on expert data. In the second stage, we design rule-based rewards (e.g., collision avoidance, speed limits) and fine-tune the model using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy, to align its predictions with these planning principles. Experiments on the nuPlan benchmark demonstrate that our Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance. Our code will be made public soon.

Summary

Plan-R1: Safe and Feasible Trajectory Planning as LLMing

The paper "Plan-R1: Safe and Feasible Trajectory Planning as LLMing," introduces a novel framework for trajectory planning in autonomous driving. This framework, designated as Plan-R1, seeks to enhance the safety and feasibility of autonomous driving systems by harnessing the predictive capabilities of LLMing, a method that has demonstrated considerable success in other domains.

Technical Approach

Plan-R1 operates as a two-stage trajectory planning framework structured around sequential prediction tasks. The first stage involves the training of an autoregressive model on expert demonstration data, which serves to establish a basic prediction of future vehicle motions based on previous and current observations. The subsequent stage employs Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to align the model's predictions with explicit planning principles, including safety measures, comfort, and adherence to traffic rules. A key insight driving Plan-R1 is the notion that trajectory planning can be formulated like LLMing—where sequences of actions (or motions) are predicted and refined based on defined rewards.

Methodology

The methodology within Plan-R1 is detailed as follows:

Autoregressive Pre-training: This phase focuses on learning from sequences of expert driving data. The paper proposes discretizing trajectories into motion tokens, thereby transforming the planning task into a sequence prediction problem akin to natural language processing tasks. An attention-based model architecture is used to capture temporal dynamics and agent interactions.
Rule-based Reinforcement Learning: The fine-tuning stage is crucially distinguished by its reinforcement learning component. Here, Plan-R1 applies interpretable rule-based reward functions to ensure trajectory predictions satisfy core principles such as collision avoidance and speed limit compliance. GRPO facilitates efficient policy optimization without requiring extensive preference data, thus setting Plan-R1 apart from methods reliant on reinforcement learning from human feedback (RLHF).

Experimental Results

Experimental validation on the nuPlan benchmark reveals Plan-R1's competence in addressing major safety concerns associated with imitation learning-based methods, such as the tendency to replicate suboptimal human driving patterns like speeding. The results show a significant improvement in planning metrics, notably in reactive closed-loop simulation settings, where the planner outperforms existing planners in terms of safety and feasibility.

Implications and Future Directions

This paper introduces significant advancements in automated trajectory planning by effectively decoupling behavioral imitation from safety principle alignment, a strategy that could be leveraged in broader applications across autonomous systems. Plan-R1 offers a methodologically sound approach to harmonizing the learned driving behaviors with safety-critical operation requirements. Future research may further explore the integration of this two-stage framework with more complex scenario adaptations, as well as the potential incorporation of real-time environmental feedback to enhance learning models. Additionally, advancing reinforcement strategies like GRPO and exploring its applicability beyond autonomous driving could yield significant benefits in how learning-based strategies are implemented across different domains of AI.

In summary, Plan-R1 presents an innovative approach to addressing safety in autonomous vehicles by way of sequential modeling and reinforcement learning, promising a more secure and effective trajectory planning capability.