RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning (2502.13144v1)

Published 18 Feb 2025 in cs.CV and cs.RO

Abstract: Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.

Summary

The paper presents RAD, a method for training end-to-end autonomous driving policies by combining reinforcement learning and imitation learning in a photorealistic 3DGS-based environment.
The training employs a three-stage paradigm encompassing perception pre-training, imitation learning for planning, and reinforced post-training within a closed-loop 3DGS simulation.
Evaluation demonstrates that the RAD approach significantly reduces collisions, achieving a 3x lower collision rate compared to imitation learning-only methods while maintaining competitive trajectory fidelity.

The paper presents a highly technical framework for end-to-end autonomous driving by integrating reinforcement learning (RL) with imitation learning (IL) in a photorealistic, closed-loop environment generated via 3D Gaussian Splatting (3DGS). The proposed method addresses key deficiencies in IL‐only solutions—including causal confusion and the open-loop gap—by enabling the policy to explore a broader state space and explicitly learn from safety-critical events. The approach is structured around a three-stage training paradigm:

Training Paradigm

Perception Pre-Training:

A BEV encoder transforms multi-view images into a bird’s eye view feature map, enabling instance-level extraction through dedicated map and agent heads. Ground-truth supervision from annotated map elements and agent dynamics ensures that the learned tokens capture high-level semantic and geometric information.

Planning Pre-Training via Imitation Learning:

The planning head, built upon a cascaded Transformer decoder, takes scene tokens and outputs a probabilistic action distribution. By leveraging large-scale expert driving demonstrations, the model is initialized to mimic human driving behavior. The action space is discretized into decoupled lateral and longitudinal components—defined over a short 0.5-second time horizon—to reduce dimensionality and improve convergence. For example, the lateral displacement $a^x$ is discretized into 61 uniformly spaced choices between a minimum ( $-0.75\,$ m) and maximum ( $0.75\,$ m) deviation, while the longitudinal displacement $a^y$ spans 0 to $15\,$ m.

Reinforced Post-Training with Hybrid RL-IL:
- Dynamic Collision Reward: Penalizes collisions with moving obstacles.
- Static Collision Reward: Penalizes intersections with static obstacles.
- Positional Deviation Reward: Measures the Euclidean distance from the expert trajectory beyond a threshold $d_{\text{max}}$ .
- Heading Deviation Reward: Penalizes angular misalignment beyond a threshold $\psi_{\text{max}}$ .

The reward function is formulated as

$\mathcal{R} = \{r_{\text{dc}},\; r_{\text{sc}},\; r_{\text{pd}},\; r_{\text{hd}}\},$

where each reward is triggered under corresponding safety or deviation conditions, and any such event leads to immediate episode termination to prevent further exposure to noisy sensory data.

Action Representation and Policy Optimization

The policy outputs separate probability distributions for lateral and longitudinal actions using a softmax over the output of an MLP applied on the Transformer decoder’s features combined with navigation and ego state embeddings. Specifically, the distributions are given by:

$\pi(a^x \mid s) = \operatorname{softmax}\Big(\operatorname{MLP}\big(\phi(E_\text{plan},E_\text{scene}) + E_\text{navi} + E_\text{state}\big)\Big),$

and an analogous formulation for $\pi(a^y \mid s)$ .

Values are estimated separately as $V_x(s)$ and $V_y(s)$ to match the decoupled reward structure. Generalized Advantage Estimation (GAE) is employed separately for the lateral and longitudinal advantages:

$\delta_t^x = r_t^x + \gamma V_x(s_{t+1}) - V_x(s_t),\quad \hat{A}_t^x = \sum_{l\ge0} (\gamma\lambda)^l \delta_{t+l}^x,$

with a similar formulation for the longitudinal component. These estimates are employed in a clipped objective within the Proximal Policy Optimization (PPO) framework, where separate clipping thresholds $\epsilon_x$ and $\epsilon_y$ maintain update stability.

The training further benefits from auxiliary objectives that provide dense gradient signals. These objectives modulate the predicted action probabilities (e.g., enhancing the probabilities of deceleration when a dynamic collision is imminent) by comparing the current policy with the previous one, effectively guiding the policy toward safe behavior.

Evaluation and Ablation Studies

The policy is evaluated in a closed-loop benchmark comprised of hundreds of previously unseen 3DGS environments. Metrics include:
- Collision Ratio (CR): Sum of dynamic (DCR) and static collision ratios (SCR).
- Deviation Ratios: Positional (PDR) and heading (HDR) deviations from the expert trajectory.
- Average Deviation Distance (ADD): Mean closest distance to the expert trajectory in safe segments.
- Smoothness Metrics: Longitudinal and lateral jerks.
Quantitatively, the hybrid RAD approach achieves a collision rate that is 3× lower compared to IL-only solutions, with competitive values for ADD and smoother trajectories (lower jerk). Ablation studies demonstrate:
- The optimal mixing ratio (e.g., RL:IL ratio of 4:1) that balances safety and trajectory fidelity.
- The necessity of including all reward components, particularly the dynamic collision reward, to significantly reduce collision incidents.
- The efficacy of combining auxiliary objectives with PPO-based updates.

Technical Contributions and Insights

The use of 3DGS allows the construction of photorealistic digital replicas, overcoming limitations of game-engine based simulations by providing high-fidelity sensor simulation and enabling safe, large-scale closed-loop training.
The decoupled action space—discretizing lateral and longitudinal controls over short horizons—reduces exploration complexity and yields more efficient policy learning.
By integrating IL as a regularization term within RL optimization, the method effectively bridges the gap between expert behavior and robust, safety-oriented decision-making in out-of-distribution scenarios.

Overall, the work provides a comprehensive system that effectively synergizes RL and IL for end-to-end autonomous driving using a novel 3DGS-based environment, achieving strong performance and notable improvements in safety-critical metrics without compromising on trajectory alignment with human driving behavior.

PDF Markdown

Related Papers

Find Related Papers

GitHub

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Tweets

https://twitter.com/_akhaliq/status/1892435066820661335

https://twitter.com/bern_jaeger/status/1897708134187081891

https://twitter.com/nopainkiller/status/1892852653451071688