Papers
Topics
Authors
Recent
2000 character limit reached

V-JEPA 2-AC: Action-Conditioned Post-training

Updated 27 November 2025
  • The paper demonstrates that post-training a large video model with minimal real robot data achieves accurate zero-shot planning and control without task-specific fine-tuning.
  • The methodology integrates a frozen image encoder with a new action-conditioned predictor transformer using block-causal attention and dual loss functions to optimize latent video predictions.
  • Empirical evaluations reveal that V-JEPA 2-AC outperforms baselines in grasp, reach, and pick-and-place tasks while planning faster and generalizing robustly to unseen scenarios.

Action-Conditioned Post-training (V-JEPA 2-AC) is a methodology for adapting a large, action-free, self-supervised video world model to robotic planning and control tasks by latent post-training on a small collection of real robot interaction data. Developed as an extension of V-JEPA 2—a model pretrained on over a million hours of internet-scale video—V-JEPA 2-AC demonstrates that self-supervised learning at web scale, when suitably adapted, enables accurate prediction and zero-shot planning in the physical world without environment- or task-specific fine-tuning (Assran et al., 11 Jun 2025).

1. Architecture and Model Design

V-JEPA 2-AC is derived by post-training the action-free V-JEPA 2 video encoder E()E(\cdot), which is frozen, with a new action-conditioned predictor transformer PϕP_\phi. The architecture incorporates the following features:

  • Input Representation: Each timestep kk includes
    • Visual encoding zk=E(xk)z_k=E(x_k), with xkx_k the RGB image,
    • End-effector state skR7s_k\in\mathbb{R}^7,
    • Action akR7a_k\in\mathbb{R}^7.
  • Predictor Structure: PϕP_\phi is a 24-layer, 16-head transformer with 1024 hidden units and GELU non-linearities. Distinct linear “input heads” project encoded patches, states, and actions to the predictor space. Temporal and spatial positional encoding is provided via RoPE (3D for image patches, 1D for actions/states).
  • Attention Mechanics: Block-causal attention enables tokens at time kk to attend to all tokens at times tkt\leq k.
  • Forward Computation: For a sequence (zt,st,at)tk(z_t,s_t,a_t)_{t\leq k}, the predicted next video representation is

z^k+1=Pϕ((zt,st,at)t=1k)RH×W×D.\hat z_{k+1} = P_\phi\left( (z_t,s_t,a_t)_{t=1\dots k} \right) \in \mathbb{R}^{H\times W\times D}.

  • Projection: Predictor outputs are linearly projected back to the encoder latent dimension.

2. Loss Functions for Post-training

Action-conditioned post-training employs two losses to optimize the prediction in latent video-space:

  • Teacher-Forcing Loss (One Step):

LTF(ϕ)=1Tk=1TPϕ((zt,st,at)tk)zk+11\mathcal{L}_{\rm TF}(\phi)=\frac{1}{T}\sum_{k=1}^T \left\lVert P_\phi\left( (z_t,s_t,a_t)_{t\leq k} \right) - z_{k+1} \right\rVert_1

  • Short-Rollout Loss (Two Steps):

LR(ϕ)=Pϕ(a1:2,s1,z1)z31\mathcal{L}_{\rm R}(\phi)=\left\lVert P_\phi(a_{1:2}, s_1, z_1) - z_3 \right\rVert_1

where the predictor’s step-2 input includes its own previous prediction.

  • Total Loss:

L(ϕ)=LTF(ϕ)+LR(ϕ)\mathcal{L}(\phi)=\mathcal{L}_{\rm TF}(\phi)+\mathcal{L}_{\rm R}(\phi)

No additional regularization or adversarial terms are used; an L1L_1 distance in the latent space suffices.

3. Data Regime and Preprocessing

Action-conditioned post-training uses the Droid dataset:

  • ~62 hours of Franka Panda teleoperated robot trajectories, left camera RGB only, 4 fps, 256×256256\times 256 px, clips of 3–4 seconds.
  • State skR7s_k\in\mathbb{R}^7: 3D Cartesian position, 3D Euler orientation, 1D gripper open/close.
  • Actions ak=sk+1ska_k = s_{k+1} - s_k.
  • Preprocessing includes discarding sub-4s clips and random resize-crop (aspect in [0.75,1.35][0.75,1.35]).
  • Training batches prepare 16 clips ×\times 16 frames (per GPU), global batch size 256.

4. Planning and Control with V-JEPA 2-AC

Goal-driven planning employs the trained predictor as a latent world model:

  • Encode current and goal observations: zk=E(xk)z_k = E(x_k), zg=E(xg)z_g = E(x_g).
  • Optimize a horizon-TT action sequence a^1:T\hat a_{1:T} to minimize

E(a^1:T;zk,sk,zg)=Pϕ(a^1:T;sk,zk)zg1\mathcal{E}(\hat a_{1:T};z_k,s_k,z_g) = \left\lVert P_\phi(\hat a_{1:T}; s_k, z_k) - z_g \right\rVert_1

  • Planning selects:

a1:T=argmina^1:TE(a^1:T;zk,sk,zg)a^*_{1:T} = \arg\min_{\hat a_{1:T}} \mathcal{E}(\hat a_{1:T};z_k,s_k,z_g)

  • Optimization is performed via the Cross-Entropy Method (CEM):

1
2
3
4
5
6
7
Input: P_\phi, z_g, (z_k,s_k), horizon T, iters M, N samples/iter, top-K
Initialize μ^{(1)}_{1:T} ← 0, Σ^{(1)}_{1:T} ← I
for i = 1 to M:
    Sample N action sequences from N(μ^{(i)},Σ^{(i)})
    Compute energies E_j
    Update μ^{(i+1)}, Σ^{(i+1)} from top-K sequences
Return μ^{(M+1)}_{1:T}
At each interaction cycle, only the first action is executed, then replanning occurs with the latest observation and state.

5. Empirical Evaluation

Experiments are conducted on two real Franka Panda setups in labs unseen during post-training. Evaluation comprises four main tasks (10 trials each):

  • Reach: move to a visual goal with no object.
  • Grasp: pick up a cup or box; single image goal.
  • Reach with Object: move to a new pose while holding an object.
  • Pick-and-Place: multi-stage, with three sub-goals: pre-grasp, lift, and place.

Results (averaged over both labs) demonstrate superior zero-shot performance of V-JEPA 2-AC over task-specific baselines—Octo (vision-language-action behavioral cloning, fine-tuned with hindsight relabeling) and Cosmos (diffusion video-generation world model, fine-tuned)—especially for manipulation tasks requiring object generalization:

Method Reach (no obj) Grasp (Cup) Grasp (Box) Reach w/Obj (Cup) Reach w/Obj (Box) Pick-Place (Cup) Pick-Place (Box)
Octo 100% 15% 0% 15% 70% 15% 10%
V-JEPA 2-AC 100% 65% 25% 75% 75% 80% 65%

Efficiency comparison with Cosmos (Lab 2) shows that V-JEPA 2-AC plans 16× faster (16 s/action vs. 4 min/action), with equal or superior success rates for both "Grasp" and "Pick-and-Place". Horizon =1=1 suffices for these tasks, indicating greedy planning is adequate for short-horizon objectives.

6. Observed Properties and Limitations

Key qualitative and quantitative observations about V-JEPA 2-AC include:

  • Zero-shot Generalization: Robust performance on previously unseen robots, objects, and backgrounds; no fine-tuning in deployment locations.
  • Energy Landscape: The goal-conditioned latent energy function is locally convex, enabling effective CEM optimization.
  • Implicit Physical Constraints: Model internalizes object constancy; for instance, open-gripper rollouts preserve object stationarity across predicted frames.
  • Coordinate Sensitivity: Successful action inference requires accurate camera calibration; errors can be partially corrected offline using linear fits to random action trajectories.
  • Long-Horizon Limitations: Rollouts degrade in accuracy for long horizons, limiting planning reliability when multi-step predictions are required without intermediate subgoals.
  • Goal Specification: Only image-goal planning is supported; plans conditioned on language goals are not enabled in this version, though future work may address this via LLM alignment.
  • Planning Horizon: Horizon =1=1 (greedy action) sufficed for the demonstrated manipulation tasks; complex tasks without natural sub-goals will require longer-horizon planning and rollover-robust training regimes.

7. Broader Implications

The development and empirical validation of action-conditioned post-training for V-JEPA 2 establishes the viability of leveraging web-scale visual pretraining for downstream robotic world modeling and planning with minimal physical interaction data. A plausible implication is the reduction of the data-collection burden for real-world deployment of robotic systems, as strong zero-shot transfer emerges from this approach (Assran et al., 11 Jun 2025). The separation of internet-scale self-supervised pretraining and lightweight post-hoc adaptation delineates a new paradigm in scalable robot learning pipelines. Limitations regarding long-horizon accuracy and the lack of language-conditioned planning suggest ongoing research is required to address more complex, instruction-driven, or temporally extended tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Action-Conditioned Post-training (V-JEPA 2-AC).