Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vidarc: Embodied Video Diffusion for Robotic Control

Updated 3 February 2026
  • The paper presents an embodied video diffusion model that achieves significant improvements in success rate and inference latency for robotic manipulation.
  • It leverages an autoregressive diffusion backbone with a masked inverse dynamics module and KV-caching to enable efficient closed-loop control.
  • Robust empirical evaluations and ablations demonstrate enhanced generalization and performance across diverse robotic platforms.

Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control) is an embodied video diffusion model engineered for robotic manipulation in settings characterized by limited data, complex embodiment dynamics, and high demands for temporal and physical reasoning. Distinct from prior video-based world models, Vidarc directly addresses embodiment-specific closed-loop control by integrating an autoregressive video diffusion backbone, a masked inverse dynamics module, and specialized inference mechanisms (KV-caching and re-prefill). The approach demonstrates significant improvements over state-of-the-art baselines in both success rate and inference latency, and exhibits robust generalization across robotic platforms (Feng et al., 19 Dec 2025).

1. Model Architecture and Components

1.1 Autoregressive Video Diffusion Backbone

The central component is an autoregressive transformer-based conditional diffusion model (GG). At each timestep tt, GG predicts the next image frame xt+1x_{t+1} conditioned on a natural language instruction \ell and all previous ground-truth observations o1,,oto_1, \ldots, o_t. Conditional diffusion is realized via a flow-matching ODE formulation: the model learns a vector field

vθ(xt,t;c,xprev)x0x1,v_\theta(x_t, t; c, x_{\mathrm{prev}}) \approx x_0 - x_1,

where

  • x1x_1 is the target (clean) frame,
  • x0N(0,I)x_0 \sim \mathcal{N}(0, I) is Gaussian noise,
  • xt=tx1+(1t)x0x_t = t\,x_1 + (1-t)\,x_0, with t[0,1]t \in [0,1] normalized,
  • c=[,embedding(o1,,ot)]c = [\ell, \text{embedding}(o_1,\ldots,o_t)],
  • xprevx_{\mathrm{prev}} are denoised earlier frames.

Inference involves solving

dxtdt=vθ(xt,t;c,xprev)\frac{dx_t}{dt} = v_\theta(x_t, t; c, x_{\mathrm{prev}})

autoregressively to synthesize future frames.

1.2 Masked Inverse Dynamics Module (MIDM)

Actions are inferred from predicted frames using a two-stage masked inverse dynamics model:

  • The mask predictor U(x)U(x) produces a soft mask m[0,1]3×H×Wm \in [0,1]^{3 \times H \times W}, targeting robot arm pixels.
  • The action regressor RR receives the elementwise-multiplied hard mask and frame, outputting the predicted action vector a^=R(round(m)x)\hat{a} = R(\text{round}(m) \odot x).

The MIDM is trained to minimize

Laction=Ex,a[(a^a)+λm1],\mathcal{L}_\text{action} = \mathbb{E}_{x,a}\big[\ell(\hat{a} - a) + \lambda \|m\|_1\big],

where ()\ell(\cdot) is the Huber loss and the L1L_1 mask penalty term sparsifies the mask.

1.3 Cached Autoregressive Generation and Re-prefill

Vidarc employs KV-caching to avoid recomputation of key/value (KV) pairs for previous frames in the transformer. For closed-loop feedback, "re-prefill" refreshes only the last nn KV cache entries upon new environment observations, dramatically reducing the overhead from O(t)O(t) to O(n)O(n).

2. Mathematical Formulation

2.1 Diffusion Processes

  • Forward noising: xt=tx1+(1t)x0x_t = t\,x_1 + (1-t)\,x_0; alternatively, as a discrete-time Markov chain,

q(xtxt1)=N(xt;αtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I)

  • Reverse model:

pθ(xt1xt,a1:t)exp(vθ(xt,t;c,xprev)(x0x1)2)p_\theta(x_{t-1} \mid x_t, a_{1:t}) \propto \exp\big( -\| v_\theta(x_t, t; c, x_{\text{prev}}) - (x_0 - x_1) \|^2 \big)

with cc incorporating both instructions and past action encodings.

2.2 Conditioning on Actions

At timestep tt, the model predicts frame x^t+1\hat{x}_{t+1}, infers action a^t\hat{a}_t from it via MIDM, and cross-attends to these actions in the context for subsequent time steps, thereby informing vθv_\theta about intended controls.

2.3 Multi-task Training Loss

  • Embodiment-aware video loss:

Lvideo=Ex0,x1,t,c,xprev[(1+ηU(x1))(vθ(xt,t;c,xprev)(x0x1))22]\mathcal{L}_\text{video} = \mathbb{E}_{x_0,x_1,t,c,x_{\mathrm{prev}}} \left[\big\|(1+\eta\cdot U(x_1)) \odot (v_\theta(x_t, t; c,x_{\mathrm{prev}})-(x_0-x_1))\big\|_2^2\right]

with η\eta modulating the weighting on action-relevant pixels.

  • IDM loss: as specified above.
  • Total: Ltotal=Lvideo+LIDM\mathcal{L}_\text{total} = \mathcal{L}_\text{video} + \mathcal{L}_\text{IDM}.

3. Training Regimen and Optimization

3.1 Dataset Construction and Preprocessing

  • Pre-training: 1\sim 1 million video episodes from Egodex, Agibot, RDT, RoboMind (>4 sources) with varied embodiments and camera setups.
  • Fine-tuning: 1,000 RoboTwin simulation episodes (Aloha robot), and 2,307 real-world episodes on Aloha hardware over 219 tasks.
  • Preprocessing: Downsample to 10 fps, resize to 736×640736 \times 640, use classifier-free guidance (\ell is dropped with probability 0.1), and standard augmentations.

3.2 Hyperparameter Choices

Component Parameter Count Optimizer / LR Batch Size Training Regime
Video diffusion transformer \sim5B AdamW, 2e-5 128 10k pretrain, 4k finetune
Masked IDM \sim92M AdamW, 5e-4, λ=3e3\lambda=3e-3 128 60k steps
  • Sampling: chunks n=16n=16, diffusion steps T=20T=20 (sim), T=5T=5 (real-world).

4. Closed-Loop Control Mechanism

At each control loop iteration tt:

  1. Generate next nn predicted frames: y^t+1:t+n=G(,KVcache)\hat{y}_{t+1:t+n} = G(\ell, \text{KV}_\text{cache}).
  2. Infer current action at=I(y^t+1)a_t = I(\hat{y}_{t+1}) via MIDM.
  3. Execute ata_t on the robot, receive ot+1o_{t+1} from the environment.
  4. Update KV cache by replacing the last nn entries with ground-truth observation prefills for efficient real-time feedback.

The control loop is formalized as follows:

1
2
3
4
5
6
7
Initialize KV_cache = G.prefill(o)
for t=1,,T:
  ŷ_seq = G.generate_chunk(ℓ, KV_cache)         # autoregressive with KV caching
  a_t   = I(ŷ_seq[1])                           # inverse dynamics
  o_{t+1}= Robot.execute(a_t)                   # environment feedback
  KV_cache.pop(n); KV_cache.extend(G.prefill(o_{t+1:t+n}))
end
This unrolling ensures each generated chunk can be efficiently re-grounded in actual observations to maintain closed-loop fidelity.

5. Empirical Evaluation and Ablations

5.1 Success Rates

Setting Pi0.5 (%) Vidar (%) Vidarc (%)
RoboTwin (sim, 14 tasks) 52.9 71.1 80.7
Real-World 41.0 39.0 56.0

Vidarc achieves +9.6 percentage points over Vidar and +27.8 over Pi0.5 in simulated tasks; real-world deployment shows +17 pp over Vidar and +15 pp over Pi0.5.

5.2 Latency and Throughput

Measured on NVIDIA A100, for 64 frames:

Metric Pi0.5 Vidar Vidarc
Per-step Latency (s) 0.482 34.3 3.03
End-to-end Cost (s) 5.76 34.3 24.2

Vidarc's latency is reduced by 91% relative to Vidar.

5.3 Ablation Insights

  • Without embodiment-aware weighting, Vidarc's simulation success rate drops from 80.7% to 74.6%.
  • Without closed-loop (open-loop imagination), success falls to 66.8%.
  • Mask strength η\eta in {0,3,10}\{0,3,10\}: success rates of 74.6%, 80.7%, 77.1%, indicating stability near η=3\eta=3.

6. Generalization and Robustness

6.1 Cross-Embodiment Generalization

Vidarc maintains robust performance when deployed to previously unseen objects, backgrounds, and robotic embodiments; only minimal recalibration of the mask predictor UU is required. In dynamic object relocation tests, Vidarc records a 40% success rate versus 0% for Vidar, demonstrating adaptive error correction in real time.

6.2 Error Correction and Failure Analysis

Open-loop diffusion models tend to accumulate drift in imagined video states, leading to control failure. Vidarc’s closed-loop inference with re-prefill corrects for execution-prediction mismatch at each cycle, re-grounding the model and enabling recovery from perturbations.


Vidarc fuses autoregressive, causal video diffusion modeling with masked action inference and closed-loop, efficient inference strategies. Its embodiment-aware loss and re-grounded inference loop result in marked improvements in both efficacy and computational efficiency for robotic control applications (Feng et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidarc: Embodied Video Diffusion Model.