End-to-End Autonomous Driving (E2E-AD)

Updated 28 April 2026

End-to-end autonomous driving (E2E-AD) is a data-driven approach that maps diverse, high-dimensional sensory inputs directly to vehicle control outputs, eliminating modular perception pipelines.
It employs large driving models that integrate multi-modal sensor fusion, specialized expert routing, and transformer-based architectures to handle rare and complex driving scenarios.
Innovations in training regimes, reinforcement learning, and continuous OTA updates enable scalable deployment and improved safety metrics in commercial autonomous driving systems.

End-to-End Autonomous Driving (E2E-AD) is defined as the learning of a single, fully differentiable function mapping raw, high-dimensional sensory inputs—including multi-view cameras, LiDAR, radar, audio, maps, and vehicle state—directly to low-dimensional driving actions or trajectories. This paradigm eliminates hand-engineered perception, prediction, and planning modules, instead combining them in a data-driven monolithic architecture where all features are jointly optimized for the target driving objective (e.g., safe, comfortable, and law-compliant vehicle control). The architectural and algorithmic advances driving recent E2E-AD research have enabled these systems to outperform traditional modular pipelines in handling the long-tail distribution of rare, complex driving scenarios, especially as large-scale fleet deployment, reinforcement-fine-tuning, and scalable transformer-based architectures become commercially viable (Nebot et al., 17 Mar 2026).

1. System Architecture: Large Driving Models and Core Building Blocks

Modern E2E-AD relies on Large Driving Models (LDMs) that ingest heterogeneous sensor modalities and output either low-level control (steering, acceleration) or planned ego-vehicle trajectories. The canonical high-level architecture is:

Sensor-fusion frontend: Projects raw sensor data—multi-view cameras ( $I_\mathrm{cam}$ ), LiDAR ( $P_\mathrm{lidar}$ ), radar ( $R_\mathrm{radar}$ ), audio ( $A_\mathrm{audio}$ ), maps ( $M_\mathrm{map}$ ), and vehicle state ( $u_\mathrm{kin}$ )—into a joint tensor representation.
Backbone encoder: A combination of convolutional neural networks (CNNs) and transformer layers extracts spatio-temporal features $z_t$ from the fused tensor.
Context router / Mixture-of-Models (MoM): A router selects specialized network experts (e.g., for intersection handling, merging) based on $z_t$ . This specialization, as realized in DriveMoE and Tesla FSD V14, improves scalability and rare event handling (Yang et al., 22 May 2025, Nebot et al., 17 Mar 2026).
Control head: Maps expert features to vehicle actions $a_t=\{\delta_\mathrm{steer},\alpha_\mathrm{accel}\}$ or future trajectory waypoints.
Safety and monitoring module: Employs deep ensembles or Monte-Carlo dropout to estimate uncertainty $u_t$ , triggers fallback or driver alert if confidence in actions falls below a threshold.

Example implementations include Tesla FSD V12–V14 (vision/auditory, transformer backbones), Rivian’s Unified Intelligence (multi-modal, “data flywheel” learning), and NVIDIA Cosmos/Alpamayo (world-model–based, policy ensemble fallback) (Nebot et al., 17 Mar 2026).

2. Training Regimes and Deployment Strategies

E2E-AD training and deployment follow a phased, curriculum-guided approach.

Phase 1—Imitation Learning (Behavior Cloning):
- Networks are pre-trained on millions of hours of real-world fleet data in “shadow mode” using human-labeled control data and curated interventions.
- Synthetic data augmentation simulates rare weather, lighting, or corner-cases (Kim et al., 28 Oct 2025).
- The loss is standard $P_\mathrm{lidar}$ 0 distance between predicted and human control: $P_\mathrm{lidar}$ 1.
- Example: Tesla FSD V12 is trained directly via behavior cloning; similar strategies are used in PTA-RLHG with transformer architectures (Hu et al., 2024).
Phase 2—Reinforcement and Edge-Case Learning:
- ADR-tagged real-world edge cases and simulator-generated rare scenarios are prioritized.
- Reinforcement learning based on safety, comfort, and collision-avoidance rewards, often with PPO or SAC, pushes the system beyond the limitations of imitation.
- On-policy and simulated “infinite driving” (as in NVIDIA Cosmos) cover long-tail exposures (Nebot et al., 17 Mar 2026).
Curriculum and Domain Adaptation:
- Training proceeds from spatially constrained, low-complexity environments to full operational domains. Domain randomization (texture, dynamics) and sim-to-real fine-tuning are standard.
- Multi-region training addresses global deployment (e.g., left-hand/right-hand drive, regulatory signage).
Continuous Learning and OTA Deployment:
- Deployed LDMs are updated online from fleet-collected behavior and real-time interventions, enabling continuous improvement and fast adaptation (the “data flywheel” effect) (Nebot et al., 17 Mar 2026, Yang et al., 22 May 2025).
“Supervised E2E” and FSD (L2++):
- These systems perform the entire Dynamic Driving Task under human supervision, offering commercial deployment with the human as an explicit fallback (e.g., Tesla FSD Supervised, Mercedes-NVIDIA, Rivian-VW JV in production by 2026) (Nebot et al., 17 Mar 2026).

3. Safety, Evaluation Metrics, and Operational Evidence

Robust safety architectures and comprehensive evaluation protocols are fundamental.

Uncertainty and Fallback: Deep ensembles, MC-dropout, or auxiliary uncertainty heads provide real-time confidence estimates. If $P_\mathrm{lidar}$ 2, the system alerts the driver or switches to a fail-safe stack (as in NVIDIA Cosmos dual-stack architectures) (Nebot et al., 17 Mar 2026).
Metrics:
- Disengagement rate (DR): Number of human interventions per 1,000 miles.
- Collision rates (major/minor): Miles per collision incident.
- Near-miss rate: Surrogate for tail-risk exposure (e.g., harsh braking events per mile).
- Safety margin violation: Fraction of operational time below minimum clearance $P_\mathrm{lidar}$ 3.
Reported performance:
- Tesla FSD Supervised: 1 major collision per 5.1 million miles (vs 0.7M US average), 1 minor collision per 1.5M (vs 0.23M), a $P_\mathrm{lidar}$ 4 reduction in collisions vs. human baseline (Nebot et al., 17 Mar 2026).
- Normalization: Results are stratified by severity, geography, road type, and user demographics for regulatory and insurance clarity.

Early operational evidence across scenarios shows that E2E learning architectures manage rare and complex distributions with lower collision rates and higher completion fractions than both hand-coded and hybrid modular stacks (Nebot et al., 17 Mar 2026, Jia et al., 2024, Jia et al., 7 Mar 2025).

4. Innovations in Representation and Scalability

Recent advances target both architectural efficiency and real-world deployability.

Parallel, Sparse, and Streaming Attention: DriveTransformer eliminates the sequential pipeline, replacing it with unified, layer-wise agent/map/ego parallel queries and sparse cross-attention to sensor tokens, with temporal cross-attention across historical tokens (Jia et al., 7 Mar 2025). This improves gradient flow, training stability, and system throughput.
Mixture-of-Experts (MoE): Dynamic routing of visual and action experts (as in DriveMoE) enables per-scenario specialization, markedly improving rare maneuver (e.g., aggressive turns, emergency braking) handling and increasing closed-loop success rates (Yang et al., 22 May 2025).
World-Model Integration: Several systems model future hypothetical scene dynamics (“what-if” simulations) for richer, foresight-informed trajectory generation (e.g., MindDrive, NVIDIA Cosmos) (Suna et al., 4 Dec 2025, Nebot et al., 17 Mar 2026).
Viewpoint-Invariance and Robustness: VR-Drive introduces a feed-forward 3D Gaussian splatting module, with joint reconstruction and distillation training, to ensure planning-stage features are robust under camera viewpoint perturbations; this is critical for mass-manufacturing with diverse sensor layouts (Cho et al., 27 Oct 2025).
Unsupervised/Self-Supervised Objectives: Eliminating expensive 3D annotations, pipelines like UAD and SSR directly encode angular objectness, temporal dynamics, and navigation-guided sparse representations, accelerating training and enabling scaling to massive unlabeled driving corpora (Guo et al., 2024, Li et al., 2024).
Auxiliary Reasoning Distillation: Approaches like VLM-AD and E³AD distill human-level reasoning or cognitive priors (from vision-LLMs or EEG) into the E2E backbone, endowing networks with enhanced robustness to rare events (Xu et al., 2024, Niu et al., 3 Nov 2025).

5. Role of Synthetic Data, Benchmarking, and Evaluation Protocols

Achieving reliable evaluation and generalization requires advances in synthetic data integration and benchmarking.

Synthetic Data Generation: SynAD generates synthetic ego-centric scenarios using conditional diffusion models, designates the agent with maximal path displacement as the ego, and aligns synthetic data with real by learning a map-to-BEV transformation, thereby improving coverage of rare hazards (Kim et al., 28 Oct 2025).
Closed-Loop, Multi-Ability Benchmarks: Bench2Drive offers a standardized, scenario-disentangled protocol evaluating E2E-AD across 44 scenarios, multiple weathers, and towns—an essential contrast to open-loop log-replay (e.g., nuScenes) which cannot evaluate causal interactions or rare-event robustness (Jia et al., 2024).
Multi-Objective Metrics: State-of-the-art frameworks, especially MindDrive and SUPER-AD, optimize not only for low collision (safety) but also route compliance, comfort, and regulatory adherence (Suna et al., 4 Dec 2025, Ryu et al., 28 Nov 2025). Evaluations now stratify by safety-criticality (NAVSAFE), complexity (NAVHARD), and generalization on unseen towns/weather.

Benchmark results show closed-loop metrics can diverge from open-loop error, emphasizing the need for comprehensive scenario-based and multi-metric assessment (Jia et al., 2024, Jia et al., 7 Mar 2025).

6. Commercialization, Societal Impact, and Future Directions

End-to-end autonomous driving is transitioning from research to commercial deployment.

Deployment and Commercialization: 2025–2026 marks the rollout of supervised E2E-AD (FSD Supervised, L2++) from Tesla, Rivian-VW JV, and Mercedes-NVIDIA. OEMs are increasingly licensing LDMs, focusing on data-collection, fleet learning, and localized adaptation (Nebot et al., 17 Mar 2026).
Cost and Scalability: Vision-only E2E deployments are estimated at $P_\mathrm{lidar}$ 5 lower hardware cost than heavy sensor-fusion stacks, owing to latent knowledge transfer and scalable OTA policy updates.
Beyond Vehicles—Embodied AI: The same E2E-AD principles now underpin E2E learning for humanoid robots (Tesla Optimus, NVIDIA GR00T), unified by world model simulation, modular expert routing, and continuous online adaptation (Nebot et al., 17 Mar 2026).
Challenges: Persistent open issues include sim-to-real transfer, rare event generalization, transparent and interpretable decision-making, verifiable safety, and efficient real-world on-policy RL.

Outlook suggests unified E2E large driving models will become foundational, not only for consumer AD but for physically embodied intelligent agents at large, shaped by advances in data-driven representation, multi-objective optimization, and human-compliant reasoning (Nebot et al., 17 Mar 2026, Suna et al., 4 Dec 2025).

References:

(Nebot et al., 17 Mar 2026) The Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models
(Jia et al., 7 Mar 2025) DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving
(Cho et al., 27 Oct 2025) VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting
(Yang et al., 22 May 2025) DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
(Hu et al., 2024) Pre-trained Transformer-Enabled Strategies with Human-Guided Fine-Tuning for End-to-end Navigation of Autonomous Vehicles
(Suna et al., 4 Dec 2025) MindDrive: An All-in-One Framework Bridging World Models and Vision-LLM for End-to-End Autonomous Driving
(Ryu et al., 28 Nov 2025) SUPER-AD: Semantic Uncertainty-aware Planning for End-to-End Robust Autonomous Driving
(Zheng et al., 26 Feb 2026) Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
(Niu et al., 3 Nov 2025) Embodied Cognition Augmented End2End Autonomous Driving
(Guo et al., 2024) End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation
(Jia et al., 2024) Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving