V-JEPA 2-AC: Action-Conditioned Post-training
- The paper demonstrates that post-training a large video model with minimal real robot data achieves accurate zero-shot planning and control without task-specific fine-tuning.
- The methodology integrates a frozen image encoder with a new action-conditioned predictor transformer using block-causal attention and dual loss functions to optimize latent video predictions.
- Empirical evaluations reveal that V-JEPA 2-AC outperforms baselines in grasp, reach, and pick-and-place tasks while planning faster and generalizing robustly to unseen scenarios.
Action-Conditioned Post-training (V-JEPA 2-AC) is a methodology for adapting a large, action-free, self-supervised video world model to robotic planning and control tasks by latent post-training on a small collection of real robot interaction data. Developed as an extension of V-JEPA 2—a model pretrained on over a million hours of internet-scale video—V-JEPA 2-AC demonstrates that self-supervised learning at web scale, when suitably adapted, enables accurate prediction and zero-shot planning in the physical world without environment- or task-specific fine-tuning (Assran et al., 11 Jun 2025).
1. Architecture and Model Design
V-JEPA 2-AC is derived by post-training the action-free V-JEPA 2 video encoder , which is frozen, with a new action-conditioned predictor transformer . The architecture incorporates the following features:
- Input Representation: Each timestep includes
- Visual encoding , with the RGB image,
- End-effector state ,
- Action .
- Predictor Structure: is a 24-layer, 16-head transformer with 1024 hidden units and GELU non-linearities. Distinct linear “input heads” project encoded patches, states, and actions to the predictor space. Temporal and spatial positional encoding is provided via RoPE (3D for image patches, 1D for actions/states).
- Attention Mechanics: Block-causal attention enables tokens at time to attend to all tokens at times .
- Forward Computation: For a sequence , the predicted next video representation is
- Projection: Predictor outputs are linearly projected back to the encoder latent dimension.
2. Loss Functions for Post-training
Action-conditioned post-training employs two losses to optimize the prediction in latent video-space:
- Teacher-Forcing Loss (One Step):
- Short-Rollout Loss (Two Steps):
where the predictor’s step-2 input includes its own previous prediction.
- Total Loss:
No additional regularization or adversarial terms are used; an distance in the latent space suffices.
3. Data Regime and Preprocessing
Action-conditioned post-training uses the Droid dataset:
- ~62 hours of Franka Panda teleoperated robot trajectories, left camera RGB only, 4 fps, px, clips of 3–4 seconds.
- State : 3D Cartesian position, 3D Euler orientation, 1D gripper open/close.
- Actions .
- Preprocessing includes discarding sub-4s clips and random resize-crop (aspect in ).
- Training batches prepare 16 clips 16 frames (per GPU), global batch size 256.
4. Planning and Control with V-JEPA 2-AC
Goal-driven planning employs the trained predictor as a latent world model:
- Encode current and goal observations: , .
- Optimize a horizon- action sequence to minimize
- Planning selects:
- Optimization is performed via the Cross-Entropy Method (CEM):
1 2 3 4 5 6 7 |
Input: P_\phi, z_g, (z_k,s_k), horizon T, iters M, N samples/iter, top-K
Initialize μ^{(1)}_{1:T} ← 0, Σ^{(1)}_{1:T} ← I
for i = 1 to M:
Sample N action sequences from N(μ^{(i)},Σ^{(i)})
Compute energies E_j
Update μ^{(i+1)}, Σ^{(i+1)} from top-K sequences
Return μ^{(M+1)}_{1:T} |
5. Empirical Evaluation
Experiments are conducted on two real Franka Panda setups in labs unseen during post-training. Evaluation comprises four main tasks (10 trials each):
- Reach: move to a visual goal with no object.
- Grasp: pick up a cup or box; single image goal.
- Reach with Object: move to a new pose while holding an object.
- Pick-and-Place: multi-stage, with three sub-goals: pre-grasp, lift, and place.
Results (averaged over both labs) demonstrate superior zero-shot performance of V-JEPA 2-AC over task-specific baselines—Octo (vision-language-action behavioral cloning, fine-tuned with hindsight relabeling) and Cosmos (diffusion video-generation world model, fine-tuned)—especially for manipulation tasks requiring object generalization:
| Method | Reach (no obj) | Grasp (Cup) | Grasp (Box) | Reach w/Obj (Cup) | Reach w/Obj (Box) | Pick-Place (Cup) | Pick-Place (Box) |
|---|---|---|---|---|---|---|---|
| Octo | 100% | 15% | 0% | 15% | 70% | 15% | 10% |
| V-JEPA 2-AC | 100% | 65% | 25% | 75% | 75% | 80% | 65% |
Efficiency comparison with Cosmos (Lab 2) shows that V-JEPA 2-AC plans 16× faster (16 s/action vs. 4 min/action), with equal or superior success rates for both "Grasp" and "Pick-and-Place". Horizon suffices for these tasks, indicating greedy planning is adequate for short-horizon objectives.
6. Observed Properties and Limitations
Key qualitative and quantitative observations about V-JEPA 2-AC include:
- Zero-shot Generalization: Robust performance on previously unseen robots, objects, and backgrounds; no fine-tuning in deployment locations.
- Energy Landscape: The goal-conditioned latent energy function is locally convex, enabling effective CEM optimization.
- Implicit Physical Constraints: Model internalizes object constancy; for instance, open-gripper rollouts preserve object stationarity across predicted frames.
- Coordinate Sensitivity: Successful action inference requires accurate camera calibration; errors can be partially corrected offline using linear fits to random action trajectories.
- Long-Horizon Limitations: Rollouts degrade in accuracy for long horizons, limiting planning reliability when multi-step predictions are required without intermediate subgoals.
- Goal Specification: Only image-goal planning is supported; plans conditioned on language goals are not enabled in this version, though future work may address this via LLM alignment.
- Planning Horizon: Horizon (greedy action) sufficed for the demonstrated manipulation tasks; complex tasks without natural sub-goals will require longer-horizon planning and rollover-robust training regimes.
7. Broader Implications
The development and empirical validation of action-conditioned post-training for V-JEPA 2 establishes the viability of leveraging web-scale visual pretraining for downstream robotic world modeling and planning with minimal physical interaction data. A plausible implication is the reduction of the data-collection burden for real-world deployment of robotic systems, as strong zero-shot transfer emerges from this approach (Assran et al., 11 Jun 2025). The separation of internet-scale self-supervised pretraining and lightweight post-hoc adaptation delineates a new paradigm in scalable robot learning pipelines. Limitations regarding long-horizon accuracy and the lack of language-conditioned planning suggest ongoing research is required to address more complex, instruction-driven, or temporally extended tasks.