V-JEPA 2-AC: Action-Conditioned Post-training
- The paper demonstrates that post-training a large video model with minimal real robot data achieves accurate zero-shot planning and control without task-specific fine-tuning.
- The methodology integrates a frozen image encoder with a new action-conditioned predictor transformer using block-causal attention and dual loss functions to optimize latent video predictions.
- Empirical evaluations reveal that V-JEPA 2-AC outperforms baselines in grasp, reach, and pick-and-place tasks while planning faster and generalizing robustly to unseen scenarios.
Action-Conditioned Post-training (V-JEPA 2-AC) is a methodology for adapting a large, action-free, self-supervised video world model to robotic planning and control tasks by latent post-training on a small collection of real robot interaction data. Developed as an extension of V-JEPA 2—a model pretrained on over a million hours of internet-scale video—V-JEPA 2-AC demonstrates that self-supervised learning at web scale, when suitably adapted, enables accurate prediction and zero-shot planning in the physical world without environment- or task-specific fine-tuning (Assran et al., 11 Jun 2025).
1. Architecture and Model Design
V-JEPA 2-AC is derived by post-training the action-free V-JEPA 2 video encoder , which is frozen, with a new action-conditioned predictor transformer . The architecture incorporates the following features:
- Input Representation: Each timestep includes
- Visual encoding , with the RGB image,
- End-effector state ,
- Action .
- Predictor Structure: is a 24-layer, 16-head transformer with 1024 hidden units and GELU non-linearities. Distinct linear “input heads” project encoded patches, states, and actions to the predictor space. Temporal and spatial positional encoding is provided via RoPE (3D for image patches, 1D for actions/states).
- Attention Mechanics: Block-causal attention enables tokens at time to attend to all tokens at times .
- Forward Computation: For a sequence 0, the predicted next video representation is
1
- Projection: Predictor outputs are linearly projected back to the encoder latent dimension.
2. Loss Functions for Post-training
Action-conditioned post-training employs two losses to optimize the prediction in latent video-space:
- Teacher-Forcing Loss (One Step):
2
- Short-Rollout Loss (Two Steps):
3
where the predictor’s step-2 input includes its own previous prediction.
- Total Loss:
4
No additional regularization or adversarial terms are used; an 5 distance in the latent space suffices.
3. Data Regime and Preprocessing
Action-conditioned post-training uses the Droid dataset:
- ~62 hours of Franka Panda teleoperated robot trajectories, left camera RGB only, 4 fps, 6 px, clips of 3–4 seconds.
- State 7: 3D Cartesian position, 3D Euler orientation, 1D gripper open/close.
- Actions 8.
- Preprocessing includes discarding sub-4s clips and random resize-crop (aspect in 9).
- Training batches prepare 16 clips 0 16 frames (per GPU), global batch size 256.
4. Planning and Control with V-JEPA 2-AC
Goal-driven planning employs the trained predictor as a latent world model:
- Encode current and goal observations: 1, 2.
- Optimize a horizon-3 action sequence 4 to minimize
5
- Planning selects:
6
- Optimization is performed via the Cross-Entropy Method (CEM):
9 At each interaction cycle, only the first action is executed, then replanning occurs with the latest observation and state.
5. Empirical Evaluation
Experiments are conducted on two real Franka Panda setups in labs unseen during post-training. Evaluation comprises four main tasks (10 trials each):
- Reach: move to a visual goal with no object.
- Grasp: pick up a cup or box; single image goal.
- Reach with Object: move to a new pose while holding an object.
- Pick-and-Place: multi-stage, with three sub-goals: pre-grasp, lift, and place.
Results (averaged over both labs) demonstrate superior zero-shot performance of V-JEPA 2-AC over task-specific baselines—Octo (vision-language-action behavioral cloning, fine-tuned with hindsight relabeling) and Cosmos (diffusion video-generation world model, fine-tuned)—especially for manipulation tasks requiring object generalization:
| Method | Reach (no obj) | Grasp (Cup) | Grasp (Box) | Reach w/Obj (Cup) | Reach w/Obj (Box) | Pick-Place (Cup) | Pick-Place (Box) |
|---|---|---|---|---|---|---|---|
| Octo | 100% | 15% | 0% | 15% | 70% | 15% | 10% |
| V-JEPA 2-AC | 100% | 65% | 25% | 75% | 75% | 80% | 65% |
Efficiency comparison with Cosmos (Lab 2) shows that V-JEPA 2-AC plans 16× faster (16 s/action vs. 4 min/action), with equal or superior success rates for both "Grasp" and "Pick-and-Place". Horizon 7 suffices for these tasks, indicating greedy planning is adequate for short-horizon objectives.
6. Observed Properties and Limitations
Key qualitative and quantitative observations about V-JEPA 2-AC include:
- Zero-shot Generalization: Robust performance on previously unseen robots, objects, and backgrounds; no fine-tuning in deployment locations.
- Energy Landscape: The goal-conditioned latent energy function is locally convex, enabling effective CEM optimization.
- Implicit Physical Constraints: Model internalizes object constancy; for instance, open-gripper rollouts preserve object stationarity across predicted frames.
- Coordinate Sensitivity: Successful action inference requires accurate camera calibration; errors can be partially corrected offline using linear fits to random action trajectories.
- Long-Horizon Limitations: Rollouts degrade in accuracy for long horizons, limiting planning reliability when multi-step predictions are required without intermediate subgoals.
- Goal Specification: Only image-goal planning is supported; plans conditioned on language goals are not enabled in this version, though future work may address this via LLM alignment.
- Planning Horizon: Horizon 8 (greedy action) sufficed for the demonstrated manipulation tasks; complex tasks without natural sub-goals will require longer-horizon planning and rollover-robust training regimes.
7. Broader Implications
The development and empirical validation of action-conditioned post-training for V-JEPA 2 establishes the viability of leveraging web-scale visual pretraining for downstream robotic world modeling and planning with minimal physical interaction data. A plausible implication is the reduction of the data-collection burden for real-world deployment of robotic systems, as strong zero-shot transfer emerges from this approach (Assran et al., 11 Jun 2025). The separation of internet-scale self-supervised pretraining and lightweight post-hoc adaptation delineates a new paradigm in scalable robot learning pipelines. Limitations regarding long-horizon accuracy and the lack of language-conditioned planning suggest ongoing research is required to address more complex, instruction-driven, or temporally extended tasks.