Drive-JEPA Framework for Visuo-Motor Control
- Drive-JEPA framework is a self-supervised, joint-embedding system that integrates visual and proprioceptive encoders with transformer-based dynamics for efficient visuo-motor planning.
- It employs multi-step predictive loss and lightweight planning (using CEM/CMA-ES) to optimize trajectories, achieving robust generalization across simulated and real robotic tasks.
- The architecture unifies training and planning regimes, reducing reliance on extrinsic rewards while delivering improved performance over prior methods like DINO-WM and V-JEPA-2-AC.
Drive-JEPA is a class of world-model-plus-planner frameworks in which goal-directed planning is performed in the joint embedding space of observations and actions. It combines a Joint-Embedding Predictive Architecture (JEPA) with lightweight, sampling-based planners to produce highly data-efficient solutions for visuo-motor control and long-horizon policy optimization. Drive-JEPA architectures are characterized by their use of self-supervised (often fully unsupervised) predictive objectives, transformer-based sequence modeling, and empirical model selection via systematic ablation of training and planning components. The result is a unified latent representation space where optimal trajectories can be found using lightweight optimization, enabling robust generalization across simulated and real-world robotics scenarios (Terver et al., 30 Dec 2025).
1. Drive-JEPA Architecture
Drive-JEPA models are structured around four principal components: encoders, a latent-dynamics predictor, diagnostic decoders, and a latent-space planner.
- Encoders:
Visual observations and, optionally, proprioceptive state vectors are passed through a frozen ViT-based image encoder () and a trainable proprioceptive encoder (), respectively. Their outputs are concatenated to yield a combined latent state embedding. Actions are encoded via a trainable action encoder .
- Latent-Dynamics Predictor:
A depth- transformer , with rotary positional embeddings (RoPE) and Adaptive LayerNorm (AdaLN) conditioning on the action tokens, models temporal evolution in the latent space. The transformer is unrolled over a context window of past steps, using predicted or teacher-forced state tokens with corresponding actions.
- Diagnostic Decoders:
Optional visual and proprioceptive decoders enable LPIPS reconstruction or prediction error monitoring during diagnostic analysis, but do not directly impact planning performance.
- Planner:
Planning is formulated as trajectory optimization in the latent space over a finite horizon , seeking to minimize the distance between the predicted latent state at horizon and the embedding of the goal observation . The main optimization method is the Cross-Entropy Method (CEM), supplemented in real-world contexts by CMA-ES via NeverGrad.
2. Training Objectives
Drive-JEPA is trained using a multi-step predictive mean squared error (MSE) computed within the embedding space of frozen encoders. For context length , the main loss for batch element with predicted latent state at each step is:
The total loss is the sum over rollout steps up to (e.g., for simulation, for real-world robotics):
Proprioceptive and visual branches are weighted equally. No contrastive or negative sampling is used. Truncated Backpropagation Through Time (TBPTT) is employed, detaching gradients from past predictions to aid stability.
3. Latent-Space Planning Algorithm
Given an initial observation and a goal , a candidate action sequence is sampled:
The planning cost to the goal is:
if proprioceptive targets are used. Optimization employs CEM: trajectories are sampled, top- are kept, mean and diagonal covariance are updated, and this is repeated for iterations before applying the best actions in the environment. Empirically, -norm cost outperforms , and CEM/L2 is superior to gradient-based solvers for contact-rich or highly multi-modal tasks.
4. Empirical Results and Ablation Insights
Final Drive-JEPA performance surpasses prior art (DINO-WM, V-JEPA-2-AC) in both navigation and manipulation. Representative success rates (mean std):
| Task | DINO-WM | V-JEPA-2-AC | Drive-JEPA |
|---|---|---|---|
| Maze | 81.6 ± 3.4 | — | 83.9 ± 2.3 |
| Wall | 64.1 ± 4.6 | — | 78.8 ± 3.9 |
| Push-T | 66.0 ± 4.7 | — | 70.2 ± 2.8 |
| MW-Reach | 44.8 ± 8.9 | — | 58.2 ± 9.3 |
| MW-Reach-Wall | 35.1 ± 9.4 | — | 41.6 ± 10.0 |
| Robocasa-Reach | 19.1 ± 13.4 | 16.2 ± 8.3 | 25.4 ± 16.6 |
| Robocasa-Place | 21.7 ± 7.2 | 33.1 ± 7.2 | 30.7 ± 8.0 |
| DROID | 39.4 ± 2.1 | 42.9 ± 2.5 | 48.2 ± 1.8 |
Boosts of +14% on Metaworld Reach and +23% on Wall are observed. Ablations highlight the following:
- Multistep rollout loss: 2-step rollout yields a 10–20% success boost in simulation; >2 steps degrades with short context, but up to 6 steps are needed in real robotics.
- Context length: Shorter training contexts (W=3 train, plan) are optimal in simulation, longer in photorealistic real tasks.
- Proprioception: Including the proprio branch increases push/navigation success by 10–20%, with reaching benefitting from additional proprioceptive input (weight ).
- Visual encoder: DINOv2/v3 image encoders outperform video encoders (V-JEPA) for manipulation; DINOv3 further improves in photorealistic environments.
- Predictor conditioning: AdaLN+RoPE is optimal on average, though feature- or sequence-based conditioning is task-dependent.
- Model scaling: ViT-S/depth-6 is sufficient in simulation; scaling to ViT-L/depth-12 brings an 8–10% improvement in real environments.
- Planner: CEM/L2 is robust across a range of tasks, while gradient descent/Adam solvers are competitive only in smooth, single-modal domains.
5. Distinguishing Features of Drive-JEPA
Several innovations distinguish Drive-JEPA within the JEPA-WM family (Terver et al., 30 Dec 2025):
- Systematic co-tuning of encoder type, context length, rollout horizon, predictor conditioning, model size, and planner.
- Alignment of training and planning regimes: matching context and rollout for sharp, optimizable latent dynamics.
- Dynamic model scaling: lightweight models suffice in simulators, while larger architectures reap gains in the real world.
- Full zero-reward, goal-conditioned, self-supervised training—no extrinsic reward is required for model learning.
- The resultant latent space yields feasible, optimizable trajectories via lightweight latent-space CEM, with robust generalization to real robot navigation and manipulation domains.
6. Relationship to Related Frameworks
Drive-JEPA establishes a new optimum within JEPA-World Model (JEPA-WM) methods by integrating insights from architecture design, training objectives, and planning mechanics. Unlike pixel-reconstruction or contrastive learning world models, Drive-JEPA does not reconstruct raw pixels or generate negative sample pairs; instead, it optimizes readily-computable prediction losses in a structured representation space. In direct empirical comparison, Drive-JEPA's principled ablation-driven methodology leads to superior planning success and transfer, providing a reproducible, scalable solution for both simulated and real-world continuous control challenges (Terver et al., 30 Dec 2025).
7. Implementation Practices and Open Challenges
Drive-JEPA's implementation best practices emphasize:
- Matching rollout length ( steps), context window, and planning horizon to task domain (simulation vs. real).
- Selection of visual encoders and predictor architectures appropriate to task visual complexity.
- Minimal reward engineering: full self-supervised loss regimes allow for rapid domain transfer.
- Lightweight planning (CEM or CMA-ES) in latent space for computational efficiency.
Open challenges include the extension to long-horizon temporal memory, multi-modal input fusion (e.g., camera, LiDAR), and further investigation of hybrid value–prediction shaping or hierarchical latent structures to close remaining gaps in generalization and sample-efficiency (Terver et al., 30 Dec 2025).
Key Reference:
- "What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?" (Terver et al., 30 Dec 2025)