Drive-JEPA Architecture for Autonomous Driving
- Drive-JEPA is a self-supervised world model that predicts masked latent embeddings, enabling efficient perception and planning in autonomous driving.
- It leverages specialized encoders and predictors, such as sparse 3D ConvNets and causal transformers, to process spatiotemporal sensor data.
- Empirical studies show Drive-JEPA outperforms traditional methods in LiDAR occupancy forecasting, trajectory planning, and emission prediction.
Drive-JEPA is a family of self-supervised world model architectures based on the joint-embedding predictive architecture (JEPA) paradigm, specifically adapted for spatiotemporal planning and perception tasks in autonomous driving scenarios. Across its instantiations, Drive-JEPA employs latent-space prediction by learning to forecast future scene embeddings or outcomes conditioned on a context window of sensor inputs, eschewing explicit pixel or point-wise generation. This section surveys the defining characteristics, mathematical principles, major components, regularization techniques, applications, and reported performance of Drive-JEPA as presented in recent literature focused on LiDAR occupancy forecasting, trajectory-centric planning, and latent world model learning (Zhu et al., 13 Feb 2026, Terver et al., 30 Dec 2025, Sundaram et al., 27 Jan 2026, Wang et al., 29 Jan 2026, Dat et al., 4 Jan 2026, Zhu et al., 9 Jan 2025).
1. Theoretical Foundations and Motivation
Drive-JEPA is motivated by the need for data-efficient, scalable, and semantically predictive world models in autonomous driving, where annotation cost and safety-critical planning drive architectural design. JEPA world models, unlike generative or contrastive approaches, operate by predicting masked embeddings in a learned latent space, sidestepping expensive pixel-level reconstruction and offering significant advantages in stability and sample efficiency. This decoupling of learning dynamics from explicit reconstruction avoids known pitfalls such as mode collapse and focus on noise or irrelevant details, promoting abstraction and robust planning under partial observability (Zhu et al., 9 Jan 2025, Zhu et al., 13 Feb 2026, Terver et al., 30 Dec 2025).
Predictive Learning Paradigm
The core training task is to forecast, given a context (past sensor data or states), the latent encoding at future time steps or in masked regions. The objective is to align the predicted JEPA embedding with the encoder’s output on the masked/future data:
where indicates a mask, is the ground-truth (target) latent, and is the predicted latent at timestep and grid cell . The similarity metric is typically cosine or distance (Zhu et al., 13 Feb 2026).
The predictor operates either as a lightweight convolutional net (for LiDAR/BEV) or a causal transformer (for video and state-action sequences).
2. Architectural Components and Data Flow
Input Processing and Masking
Drive-JEPA architectures process spatiotemporal sequences of observations (LiDAR sweeps, camera frames, or low-level engine signals), typically after geometric unification (applying known ego-motion transforms) and spatial voxelization or patching. Group BEV-guided masking is employed for LiDAR: a binary mask is computed over BEV grid cells across all frames, and any cell containing an object in any frame is masked consistently in all frames, strictly preventing ego-motion leakage (Zhu et al., 13 Feb 2026). For video, masking consists of randomly dropping 30–50% of spatiotemporal tokens (Wang et al., 29 Jan 2026).
Masked regions are replaced by learned mask tokens; empty regions use a distinct empty token. This protocol prevents information leakage from context regions and enforces the prediction of genuinely unobserved content.
Encoders
- Sparse 3D ConvNet: For LiDAR streams, Drive-JEPA uses a sparse 3D convolutional encoder, followed by a projection along the height axis to produce per-cell BEV embeddings of dimension (Zhu et al., 13 Feb 2026, Zhu et al., 9 Jan 2025).
- Frozen Transformer Backbones: For image/video data, encoders are typically frozen Vision Transformer (ViT) backbones pretrained on large video datasets with DINOv2, DINOv3, or V-JEPA objectives; proprioceptive and action embeddings are incorporated via shallow MLPs (Terver et al., 30 Dec 2025, Wang et al., 29 Jan 2026).
Predictors
- Convolutional Predictor: Lightweight 3D (or 2D) convolutional nets operate on concatenated context embeddings to predict masked/future cell embeddings in spatial tasks (Zhu et al., 13 Feb 2026, Zhu et al., 9 Jan 2025).
- Transformer Module: For sequence and action-conditioned tasks, a causal transformer processes a sliding window of latent encodings and action embeddings to predict the next state embedding (Terver et al., 30 Dec 2025).
- RSSM (Recurrent State-Space Model): In long-horizon planning (HanoiWorld), the latent state is updated via RNN-based transition, providing memory and stochasticity for partially observed environments (Dat et al., 4 Jan 2026).
3. Training Objectives and Regularization
Masked Embedding Prediction Loss
Masked cell loss functions typically use cosine or distance between predicted and target latents, limited to masked cells or tokens:
Regularization Against Collapse
To prevent trivial or collapsed solutions, Drive-JEPA incorporates strong representation regularizers:
- Variance Regularization: Penalizes low variance across latent channels on non-empty cells:
- SIGReg: Spectral-invariant Gram-penalty, which enforces feature diversity without EMA targets (Zhu et al., 13 Feb 2026).
- VICReg-Inspired: In emission modeling, invariance, variance, covariance, and cross-covariance penalties are combined, with hyperparameters controlling their balance (Sundaram et al., 27 Jan 2026):
The moving-average target encoder is used in variance-regularized settings; with SIGReg or spectral methods, encoders may share weights.
4. Planning and Downstream Decoders
For planning and forecasting tasks, Drive-JEPA’s encoder outputs are repurposed for action-conditioned rollout, trajectory prediction, and occupancy inference.
Occupancy Completion and Forecasting
After representation pretraining, a small stack of 2D convolutional layers decodes concatenated BEV embeddings for past frames to predict binary future occupancy over a -step horizon, using per-cell binary cross-entropy loss: (Zhu et al., 13 Feb 2026)
Latent-Space Planning
In action-conditioned agents, Drive-JEPA supports planning in the latent space by optimizing a candidate action sequence via unrolling the predictor and minimizing the distance to a goal embedding: Planning methods include CEM (Cross-Entropy Method), NeverGrad (NG/CMA-ES), and GD/Adam (Terver et al., 30 Dec 2025).
Multimodal Trajectory Distillation
For end-to-end driving, Drive-JEPA incorporates a proposal-centric planner that generates diverse trajectory candidates, guided by both human and simulator-derived pseudo-teacher trajectories. A momentum-aware selection mechanism integrates comfort-based criteria to enforce smooth, plausible behavior. Distillation loss aggregates both human and high-quality simulated outputs (Wang et al., 29 Jan 2026).
5. Empirical Performance and Ablations
Rigorous ablation studies demonstrate that Drive-JEPA delivers superior or state-of-the-art performance in several domains:
- LiDAR Occupancy Completion/Forecasting: Drive-JEPA pretrained with SIGReg achieves higher IoU and IoU compared to training from scratch or with variance regularization (Zhu et al., 13 Feb 2026).
| Method | IoU_full (%) | IoU_close (%) |
|---|---|---|
| From scratch | 38.56 ±0.19 | 42.87 ±0.17 |
| AD-LiST-JEPA (small, var) | 39.09 ±0.36 | 43.43 ±0.39 |
| AD-LiST-JEPA (small, SIGReg) | 39.35 ±0.24 | 43.70 ±0.24 |
| AD-LiST-JEPA (full, SIGReg) | 39.41 ±0.31 | 43.86 ±0.30 |
- Physical Planning and Manipulation: A two-step latent rollout and contextual window of 3–5 frames yield highest planning success. Proprioceptive/action embeddings and DINOv3 pretraining are critical in photo-realistic tasks; AdaLN conditioning outperforms feature or sequence concatenation. Model scaling is beneficial on realistic data, not on simple simulation (Terver et al., 30 Dec 2025).
- End-to-End Driving: Pretrained ViT-based V-JEPA encoders with multimodal trajectory distillation and momentum-aware selection outperform prior methods by significant margins (e.g., +3 EPDMS) and set new state-of-the-art on closed-loop evaluation (Wang et al., 29 Jan 2026).
- Emission Prediction: Drive-JEPA with VICReg losses achieves lower RMSE and MAE than LSTM and MLP baselines in transient emission events, while structured pruning and quantization reduce model size and latency with minimal accuracy degradation (Sundaram et al., 27 Jan 2026).
6. Practical Considerations and Implementation
Training Protocols
- Datasets: Waymo Open for LiDAR, synthetic and real-world robotics for video/state, large-scale curated driving video for V-JEPA, and real PEMS/bench emission data.
- GPU/Compute: Typical setups use 8×A100 or H800 GPUs for pretraining; batch sizes and window lengths are selected based on hardware constraints and sample uniqueness.
Regularization and Stability
- SIGReg: Found more robust than variance regularization; eliminates the need for moving-average targets (Zhu et al., 13 Feb 2026).
- Variance/Covariance: Essential to avoid representational collapse and maintain latent diversity, especially over long planning rollouts (Terver et al., 30 Dec 2025, Dat et al., 4 Jan 2026).
- EMA Targets: Used in classic variance-regularized JEPA; SIGReg and spectral penalties can operate with shared-weights online/target encoders.
Efficiency
- All Drive-JEPA modules are designed for computational efficiency, facilitating deployment on-vehicle (for LiDAR) or on embedded targets (for emission control). Pruning and bfloat16 quantization yield up to 50% reductions in inference time and model size (Sundaram et al., 27 Jan 2026).
7. Variants, Limitations, and Extensions
Variants of Drive-JEPA have been proposed for spatial-only prediction (Zhu et al., 9 Jan 2025), spatiotemporal forecasting (Zhu et al., 13 Feb 2026), trajectory-centric planning (Wang et al., 29 Jan 2026), and planning in state-action domains (Terver et al., 30 Dec 2025, Dat et al., 4 Jan 2026). Limitations include:
- Safety Constraints: Current implementations rely on reward shaping; explicit risk-averse value functions or constraint-based objectives remain future work (Dat et al., 4 Jan 2026).
- Domain Generalization: Extending to multi-sensor fusion and robustness to out-of-distribution events are open challenges.
- Explicit Multimodality: Trajectory proposals are enhanced with simulator-generated diversity, but further investigation into uncertainty modeling is needed (Wang et al., 29 Jan 2026).
Potential extensions include integration with language-conditioned affordance models, cooperative V2X world modeling, and embedded real-time deployment in hybrid powertrains (Sundaram et al., 27 Jan 2026, Wang et al., 29 Jan 2026, Dat et al., 4 Jan 2026).
Drive-JEPA architectures formalize a high-performance, latent-predictive, and self-supervised world modeling framework for autonomous systems, supporting improved occupancy completion, trajectory prediction, resource-efficient inference, and robust latent-space planning across driving and control domains (Zhu et al., 13 Feb 2026, Terver et al., 30 Dec 2025, Sundaram et al., 27 Jan 2026, Wang et al., 29 Jan 2026, Dat et al., 4 Jan 2026, Zhu et al., 9 Jan 2025).