- The paper presents RAD, a method for training end-to-end autonomous driving policies by combining reinforcement learning and imitation learning in a photorealistic 3DGS-based environment.
- The training employs a three-stage paradigm encompassing perception pre-training, imitation learning for planning, and reinforced post-training within a closed-loop 3DGS simulation.
- Evaluation demonstrates that the RAD approach significantly reduces collisions, achieving a 3x lower collision rate compared to imitation learning-only methods while maintaining competitive trajectory fidelity.
The paper presents a highly technical framework for end-to-end autonomous driving by integrating reinforcement learning (RL) with imitation learning (IL) in a photorealistic, closed-loop environment generated via 3D Gaussian Splatting (3DGS). The proposed method addresses key deficiencies in IL‐only solutions—including causal confusion and the open-loop gap—by enabling the policy to explore a broader state space and explicitly learn from safety-critical events. The approach is structured around a three-stage training paradigm:
Training Paradigm
A BEV encoder transforms multi-view images into a bird’s eye view feature map, enabling instance-level extraction through dedicated map and agent heads. Ground-truth supervision from annotated map elements and agent dynamics ensures that the learned tokens capture high-level semantic and geometric information.
- Planning Pre-Training via Imitation Learning:
The planning head, built upon a cascaded Transformer decoder, takes scene tokens and outputs a probabilistic action distribution. By leveraging large-scale expert driving demonstrations, the model is initialized to mimic human driving behavior. The action space is discretized into decoupled lateral and longitudinal components—defined over a short 0.5-second time horizon—to reduce dimensionality and improve convergence. For example, the lateral displacement ax is discretized into 61 uniformly spaced choices between a minimum (−0.75m) and maximum (0.75m) deviation, while the longitudinal displacement ay spans 0 to 15m.
- Reinforced Post-Training with Hybrid RL-IL:
- Dynamic Collision Reward: Penalizes collisions with moving obstacles.
- Static Collision Reward: Penalizes intersections with static obstacles.
- Positional Deviation Reward: Measures the Euclidean distance from the expert trajectory beyond a threshold dmax.
- Heading Deviation Reward: Penalizes angular misalignment beyond a threshold ψmax.
The reward function is formulated as
R={rdc,rsc,rpd,rhd},
where each reward is triggered under corresponding safety or deviation conditions, and any such event leads to immediate episode termination to prevent further exposure to noisy sensory data.
Action Representation and Policy Optimization
- The policy outputs separate probability distributions for lateral and longitudinal actions using a softmax over the output of an MLP applied on the Transformer decoder’s features combined with navigation and ego state embeddings. Specifically, the distributions are given by:
π(ax∣s)=softmax(MLP(ϕ(Eplan,Escene)+Enavi+Estate)),
and an analogous formulation for π(ay∣s).
- Values are estimated separately as Vx(s) and Vy(s) to match the decoupled reward structure. Generalized Advantage Estimation (GAE) is employed separately for the lateral and longitudinal advantages:
δtx=rtx+γVx(st+1)−Vx(st),A^tx=l≥0∑(γλ)lδt+lx,
with a similar formulation for the longitudinal component. These estimates are employed in a clipped objective within the Proximal Policy Optimization (PPO) framework, where separate clipping thresholds ϵx and ϵy maintain update stability.
- The training further benefits from auxiliary objectives that provide dense gradient signals. These objectives modulate the predicted action probabilities (e.g., enhancing the probabilities of deceleration when a dynamic collision is imminent) by comparing the current policy with the previous one, effectively guiding the policy toward safe behavior.
Evaluation and Ablation Studies
- The policy is evaluated in a closed-loop benchmark comprised of hundreds of previously unseen 3DGS environments. Metrics include:
- Collision Ratio (CR): Sum of dynamic (DCR) and static collision ratios (SCR).
- Deviation Ratios: Positional (PDR) and heading (HDR) deviations from the expert trajectory.
- Average Deviation Distance (ADD): Mean closest distance to the expert trajectory in safe segments.
- Smoothness Metrics: Longitudinal and lateral jerks.
- Quantitatively, the hybrid RAD approach achieves a collision rate that is 3× lower compared to IL-only solutions, with competitive values for ADD and smoother trajectories (lower jerk). Ablation studies demonstrate:
- The optimal mixing ratio (e.g., RL:IL ratio of 4:1) that balances safety and trajectory fidelity.
- The necessity of including all reward components, particularly the dynamic collision reward, to significantly reduce collision incidents.
- The efficacy of combining auxiliary objectives with PPO-based updates.
Technical Contributions and Insights
- The use of 3DGS allows the construction of photorealistic digital replicas, overcoming limitations of game-engine based simulations by providing high-fidelity sensor simulation and enabling safe, large-scale closed-loop training.
- The decoupled action space—discretizing lateral and longitudinal controls over short horizons—reduces exploration complexity and yields more efficient policy learning.
- By integrating IL as a regularization term within RL optimization, the method effectively bridges the gap between expert behavior and robust, safety-oriented decision-making in out-of-distribution scenarios.
Overall, the work provides a comprehensive system that effectively synergizes RL and IL for end-to-end autonomous driving using a novel 3DGS-based environment, achieving strong performance and notable improvements in safety-critical metrics without compromising on trajectory alignment with human driving behavior.