Vidarc: Embodied Video Diffusion for Robotic Control
- The paper presents an embodied video diffusion model that achieves significant improvements in success rate and inference latency for robotic manipulation.
- It leverages an autoregressive diffusion backbone with a masked inverse dynamics module and KV-caching to enable efficient closed-loop control.
- Robust empirical evaluations and ablations demonstrate enhanced generalization and performance across diverse robotic platforms.
Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control) is an embodied video diffusion model engineered for robotic manipulation in settings characterized by limited data, complex embodiment dynamics, and high demands for temporal and physical reasoning. Distinct from prior video-based world models, Vidarc directly addresses embodiment-specific closed-loop control by integrating an autoregressive video diffusion backbone, a masked inverse dynamics module, and specialized inference mechanisms (KV-caching and re-prefill). The approach demonstrates significant improvements over state-of-the-art baselines in both success rate and inference latency, and exhibits robust generalization across robotic platforms (Feng et al., 19 Dec 2025).
1. Model Architecture and Components
1.1 Autoregressive Video Diffusion Backbone
The central component is an autoregressive transformer-based conditional diffusion model (). At each timestep , predicts the next image frame conditioned on a natural language instruction and all previous ground-truth observations . Conditional diffusion is realized via a flow-matching ODE formulation: the model learns a vector field
where
- is the target (clean) frame,
- is Gaussian noise,
- , with normalized,
- ,
- are denoised earlier frames.
Inference involves solving
autoregressively to synthesize future frames.
1.2 Masked Inverse Dynamics Module (MIDM)
Actions are inferred from predicted frames using a two-stage masked inverse dynamics model:
- The mask predictor produces a soft mask , targeting robot arm pixels.
- The action regressor receives the elementwise-multiplied hard mask and frame, outputting the predicted action vector .
The MIDM is trained to minimize
where is the Huber loss and the mask penalty term sparsifies the mask.
1.3 Cached Autoregressive Generation and Re-prefill
Vidarc employs KV-caching to avoid recomputation of key/value (KV) pairs for previous frames in the transformer. For closed-loop feedback, "re-prefill" refreshes only the last KV cache entries upon new environment observations, dramatically reducing the overhead from to .
2. Mathematical Formulation
2.1 Diffusion Processes
- Forward noising: ; alternatively, as a discrete-time Markov chain,
- Reverse model:
with incorporating both instructions and past action encodings.
2.2 Conditioning on Actions
At timestep , the model predicts frame , infers action from it via MIDM, and cross-attends to these actions in the context for subsequent time steps, thereby informing about intended controls.
2.3 Multi-task Training Loss
- Embodiment-aware video loss:
with modulating the weighting on action-relevant pixels.
- IDM loss: as specified above.
- Total: .
3. Training Regimen and Optimization
3.1 Dataset Construction and Preprocessing
- Pre-training: million video episodes from Egodex, Agibot, RDT, RoboMind (>4 sources) with varied embodiments and camera setups.
- Fine-tuning: 1,000 RoboTwin simulation episodes (Aloha robot), and 2,307 real-world episodes on Aloha hardware over 219 tasks.
- Preprocessing: Downsample to 10 fps, resize to , use classifier-free guidance ( is dropped with probability 0.1), and standard augmentations.
3.2 Hyperparameter Choices
| Component | Parameter Count | Optimizer / LR | Batch Size | Training Regime |
|---|---|---|---|---|
| Video diffusion transformer | 5B | AdamW, 2e-5 | 128 | 10k pretrain, 4k finetune |
| Masked IDM | 92M | AdamW, 5e-4, | 128 | 60k steps |
- Sampling: chunks , diffusion steps (sim), (real-world).
4. Closed-Loop Control Mechanism
At each control loop iteration :
- Generate next predicted frames: .
- Infer current action via MIDM.
- Execute on the robot, receive from the environment.
- Update KV cache by replacing the last entries with ground-truth observation prefills for efficient real-time feedback.
The control loop is formalized as follows:
1 2 3 4 5 6 7 |
Initialize KV_cache = G.prefill(o₁) for t=1, …, T: ŷ_seq = G.generate_chunk(ℓ, KV_cache) # autoregressive with KV caching a_t = I(ŷ_seq[1]) # inverse dynamics o_{t+1}= Robot.execute(a_t) # environment feedback KV_cache.pop(n); KV_cache.extend(G.prefill(o_{t+1:t+n})) end |
5. Empirical Evaluation and Ablations
5.1 Success Rates
| Setting | Pi0.5 (%) | Vidar (%) | Vidarc (%) |
|---|---|---|---|
| RoboTwin (sim, 14 tasks) | 52.9 | 71.1 | 80.7 |
| Real-World | 41.0 | 39.0 | 56.0 |
Vidarc achieves +9.6 percentage points over Vidar and +27.8 over Pi0.5 in simulated tasks; real-world deployment shows +17 pp over Vidar and +15 pp over Pi0.5.
5.2 Latency and Throughput
Measured on NVIDIA A100, for 64 frames:
| Metric | Pi0.5 | Vidar | Vidarc |
|---|---|---|---|
| Per-step Latency (s) | 0.482 | 34.3 | 3.03 |
| End-to-end Cost (s) | 5.76 | 34.3 | 24.2 |
Vidarc's latency is reduced by 91% relative to Vidar.
5.3 Ablation Insights
- Without embodiment-aware weighting, Vidarc's simulation success rate drops from 80.7% to 74.6%.
- Without closed-loop (open-loop imagination), success falls to 66.8%.
- Mask strength in : success rates of 74.6%, 80.7%, 77.1%, indicating stability near .
6. Generalization and Robustness
6.1 Cross-Embodiment Generalization
Vidarc maintains robust performance when deployed to previously unseen objects, backgrounds, and robotic embodiments; only minimal recalibration of the mask predictor is required. In dynamic object relocation tests, Vidarc records a 40% success rate versus 0% for Vidar, demonstrating adaptive error correction in real time.
6.2 Error Correction and Failure Analysis
Open-loop diffusion models tend to accumulate drift in imagined video states, leading to control failure. Vidarc’s closed-loop inference with re-prefill corrects for execution-prediction mismatch at each cycle, re-grounding the model and enabling recovery from perturbations.
Vidarc fuses autoregressive, causal video diffusion modeling with masked action inference and closed-loop, efficient inference strategies. Its embodiment-aware loss and re-grounded inference loop result in marked improvements in both efficacy and computational efficiency for robotic control applications (Feng et al., 19 Dec 2025).