Vidarc: Embodied Video Diffusion for Robotic Control

Updated 3 February 2026

The paper presents an embodied video diffusion model that achieves significant improvements in success rate and inference latency for robotic manipulation.
It leverages an autoregressive diffusion backbone with a masked inverse dynamics module and KV-caching to enable efficient closed-loop control.
Robust empirical evaluations and ablations demonstrate enhanced generalization and performance across diverse robotic platforms.

Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control) is an embodied video diffusion model engineered for robotic manipulation in settings characterized by limited data, complex embodiment dynamics, and high demands for temporal and physical reasoning. Distinct from prior video-based world models, Vidarc directly addresses embodiment-specific closed-loop control by integrating an autoregressive video diffusion backbone, a masked inverse dynamics module, and specialized inference mechanisms (KV-caching and re-prefill). The approach demonstrates significant improvements over state-of-the-art baselines in both success rate and inference latency, and exhibits robust generalization across robotic platforms (Feng et al., 19 Dec 2025).

1. Model Architecture and Components

1.1 Autoregressive Video Diffusion Backbone

The central component is an autoregressive transformer-based conditional diffusion model ( $G$ ). At each timestep $t$ , $G$ predicts the next image frame $x_{t+1}$ conditioned on a natural language instruction $\ell$ and all previous ground-truth observations $o_1, \ldots, o_t$ . Conditional diffusion is realized via a flow-matching ODE formulation: the model learns a vector field

$v_\theta(x_t, t; c, x_{\mathrm{prev}}) \approx x_0 - x_1,$

where

$x_1$ is the target (clean) frame,
$x_0 \sim \mathcal{N}(0, I)$ is Gaussian noise,
$x_t = t\,x_1 + (1-t)\,x_0$ , with $t \in [0,1]$ normalized,
$c = [\ell, \text{embedding}(o_1,\ldots,o_t)]$ ,
$x_{\mathrm{prev}}$ are denoised earlier frames.

Inference involves solving

$\frac{dx_t}{dt} = v_\theta(x_t, t; c, x_{\mathrm{prev}})$

autoregressively to synthesize future frames.

1.2 Masked Inverse Dynamics Module (MIDM)

Actions are inferred from predicted frames using a two-stage masked inverse dynamics model:

The mask predictor $U(x)$ produces a soft mask $m \in [0,1]^{3 \times H \times W}$ , targeting robot arm pixels.
The action regressor $R$ receives the elementwise-multiplied hard mask and frame, outputting the predicted action vector $\hat{a} = R(\text{round}(m) \odot x)$ .

The MIDM is trained to minimize

$\mathcal{L}_\text{action} = \mathbb{E}_{x,a}\big[\ell(\hat{a} - a) + \lambda \|m\|_1\big],$

where $\ell(\cdot)$ is the Huber loss and the $L_1$ mask penalty term sparsifies the mask.

1.3 Cached Autoregressive Generation and Re-prefill

Vidarc employs KV-caching to avoid recomputation of key/value (KV) pairs for previous frames in the transformer. For closed-loop feedback, "re-prefill" refreshes only the last $n$ KV cache entries upon new environment observations, dramatically reducing the overhead from $O(t)$ to $O(n)$ .

2. Mathematical Formulation

2.1 Diffusion Processes

Forward noising: $x_t = t\,x_1 + (1-t)\,x_0$ ; alternatively, as a discrete-time Markov chain,

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, \beta_t I)$

Reverse model:

$p_\theta(x_{t-1} \mid x_t, a_{1:t}) \propto \exp\big( -\| v_\theta(x_t, t; c, x_{\text{prev}}) - (x_0 - x_1) \|^2 \big)$

with $c$ incorporating both instructions and past action encodings.

2.2 Conditioning on Actions

At timestep $t$ , the model predicts frame $\hat{x}_{t+1}$ , infers action $\hat{a}_t$ from it via MIDM, and cross-attends to these actions in the context for subsequent time steps, thereby informing $v_\theta$ about intended controls.

2.3 Multi-task Training Loss

Embodiment-aware video loss:

$\mathcal{L}_\text{video} = \mathbb{E}_{x_0,x_1,t,c,x_{\mathrm{prev}}} \left[\big\|(1+\eta\cdot U(x_1)) \odot (v_\theta(x_t, t; c,x_{\mathrm{prev}})-(x_0-x_1))\big\|_2^2\right]$

with $\eta$ modulating the weighting on action-relevant pixels.

IDM loss: as specified above.
Total: $\mathcal{L}_\text{total} = \mathcal{L}_\text{video} + \mathcal{L}_\text{IDM}$ .

3. Training Regimen and Optimization

3.1 Dataset Construction and Preprocessing

Pre-training: $\sim 1$ million video episodes from Egodex, Agibot, RDT, RoboMind (>4 sources) with varied embodiments and camera setups.
Fine-tuning: 1,000 RoboTwin simulation episodes (Aloha robot), and 2,307 real-world episodes on Aloha hardware over 219 tasks.
Preprocessing: Downsample to 10 fps, resize to $736 \times 640$ , use classifier-free guidance ( $\ell$ is dropped with probability 0.1), and standard augmentations.

3.2 Hyperparameter Choices

Component	Parameter Count	Optimizer / LR	Batch Size	Training Regime
Video diffusion transformer	$\sim$ 5B	AdamW, 2e-5	128	10k pretrain, 4k finetune
Masked IDM	$\sim$ 92M	AdamW, 5e-4, $\lambda=3e-3$	128	60k steps

Sampling: chunks $n=16$ , diffusion steps $T=20$ (sim), $T=5$ (real-world).

4. Closed-Loop Control Mechanism

At each control loop iteration $t$ :

Generate next $n$ predicted frames: $\hat{y}_{t+1:t+n} = G(\ell, \text{KV}_\text{cache})$ .
Infer current action $a_t = I(\hat{y}_{t+1})$ via MIDM.
Execute $a_t$ on the robot, receive $o_{t+1}$ from the environment.
Update KV cache by replacing the last $n$ entries with ground-truth observation prefills for efficient real-time feedback.

The control loop is formalized as follows:

Initialize KV_cache = G.prefill(o₁)
for t=1, …, T:
  ŷ_seq = G.generate_chunk(ℓ, KV_cache)         # autoregressive with KV caching
  a_t   = I(ŷ_seq[1])                           # inverse dynamics
  o_{t+1}= Robot.execute(a_t)                   # environment feedback
  KV_cache.pop(n); KV_cache.extend(G.prefill(o_{t+1:t+n}))
end

This unrolling ensures each generated chunk can be efficiently re-grounded in actual observations to maintain closed-loop fidelity.

5. Empirical Evaluation and Ablations

5.1 Success Rates

Setting	Pi0.5 (%)	Vidar (%)	Vidarc (%)
RoboTwin (sim, 14 tasks)	52.9	71.1	80.7
Real-World	41.0	39.0	56.0

Vidarc achieves +9.6 percentage points over Vidar and +27.8 over Pi0.5 in simulated tasks; real-world deployment shows +17 pp over Vidar and +15 pp over Pi0.5.

5.2 Latency and Throughput

Measured on NVIDIA A100, for 64 frames:

Metric	Pi0.5	Vidar	Vidarc
Per-step Latency (s)	0.482	34.3	3.03
End-to-end Cost (s)	5.76	34.3	24.2

Vidarc's latency is reduced by 91% relative to Vidar.

5.3 Ablation Insights

Without embodiment-aware weighting, Vidarc's simulation success rate drops from 80.7% to 74.6%.
Without closed-loop (open-loop imagination), success falls to 66.8%.
Mask strength $\eta$ in $\{0,3,10\}$ : success rates of 74.6%, 80.7%, 77.1%, indicating stability near $\eta=3$ .

6. Generalization and Robustness

6.1 Cross-Embodiment Generalization

Vidarc maintains robust performance when deployed to previously unseen objects, backgrounds, and robotic embodiments; only minimal recalibration of the mask predictor $U$ is required. In dynamic object relocation tests, Vidarc records a 40% success rate versus 0% for Vidar, demonstrating adaptive error correction in real time.

6.2 Error Correction and Failure Analysis

Open-loop diffusion models tend to accumulate drift in imagined video states, leading to control failure. Vidarc’s closed-loop inference with re-prefill corrects for execution-prediction mismatch at each cycle, re-grounding the model and enabling recovery from perturbations.

Vidarc fuses autoregressive, causal video diffusion modeling with masked action inference and closed-loop, efficient inference strategies. Its embodiment-aware loss and re-grounded inference loop result in marked improvements in both efficacy and computational efficiency for robotic control applications (Feng et al., 19 Dec 2025).

Markdown Upgrade to Chat

References (1)

Vidarc: Embodied Video Diffusion Model for Closed-loop Control (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidarc: Embodied Video Diffusion Model.