Papers
Topics
Authors
Recent
2000 character limit reached

Vidarc: Embodied Video Diffusion Model for Closed-loop Control

Published 19 Dec 2025 in cs.RO and cs.LG | (2512.17661v1)

Abstract: Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.

Summary

  • The paper demonstrates that integrating an autoregressive video diffusion model with a masked inverse dynamics mechanism achieves real-time closed-loop robotic manipulation.
  • The method leverages cross-embodiment pretraining and environmental re-prefilling to bridge simulation and real-world performance, significantly improving task success rates.
  • Empirical results reveal up to 17% higher success rates and a 91% reduction in inference latency compared to baselines, underscoring its robustness and efficiency.

Vidarc: Embodied Video Diffusion Model for Closed-loop Robotic Control

Overview and Motivation

Vidarc proposes a novel framework for robotic arm manipulation in environments with limited data, focusing on closed-loop control utilizing an embodied video diffusion model augmented by a masked inverse dynamics mechanism (2512.17661). The main motivation lies in overcoming the limitations of previous video-based systems, which typically provide open-loop predictions and are not optimized for rapid real-time feedback integration or embodiment-specific grounding. Vidarc targets robust, generalizable, and adaptive manipulation by leveraging large-scale video data and specialized mechanisms for fast sensory-action cycles. Figure 1

Figure 1: Vidarc architecture integrates an autoregressive video diffusion model and a masked inverse dynamics component to enable closed-loop control and efficient error correction after cross-embodiment pre-training and task-specific calibration.

Technical Contributions

Vidarc advances the state-of-the-art through several key components:

  • Autoregressive Video Diffusion Model: Utilizing a causal, teacher-forced video transformer to generate sequential observations conditioned on prior robot states and task language, facilitating low-latency interaction by exploiting key-value caching.
  • Masked Inverse Dynamics Model (MIDM): The action predictor focuses on robot-relevant regions, learning spatial masks that both enhance action inference and inform the diffusion model’s loss, ensuring attention to action-critical pixels.
  • Closed-loop Control via Environmental Re-prefilling: To bridge the simulation-reality gap and combat generative drift, Vidarc continually re-injects real-world sensory feedback into its video generation pipeline. This innovation aligns train-time and inference-time distributions, enabling continual correction and adaptation in dynamic environments.
  • Embodiment-aware Diffusion Loss: Incorporating mask-based reweighting in the video model’s loss selectively emphasizes regions relevant for manipulation dynamics, improving action prediction fidelity and countering the tendency of diffusion models to overfit background details. Figure 2

    Figure 2: The system’s dual structure: video diffusion transformer predicts observations; MIDM infers actions and provides spatial mask for focused training and physical grounding.

Empirical Results

Vidarc is pre-trained on over one million episodes spanning diverse embodiments and subsequently fine-tuned on unseen robotic platforms. Empirical evaluations demonstrate substantial improvements over prior baselines (Vidar, Pi0.5):

  • On the RoboTwin benchmark (14 tasks): Vidarc achieves an average success rate of 80.7%, exceeding Vidar (71.1%) and Pi0.5 (52.9%), with especially notable gains in bimanual collaboration tasks.
  • In real-world deployments, Vidarc consistently surpasses baselines with up to 17% higher task success rates and displays robust generalization to novel tasks and objects.
  • Action execution is both fast and reactive: Vidarc reduces inference latency by 91% compared to Vidar, approaching the efficiency of specialized VLA models like Pi0.5.
  • The closed-loop mechanism enables error correction not attainable by open-loop models, especially in dynamic scenarios where target object locations change during execution. Figure 3

    Figure 3: Visual comparison of predicted video frames, learned masks, and executed actions illustrates Vidarc’s capacity for real-time error correction under environmental perturbations.

    Figure 4

Figure 4

Figure 4: Closed-loop grounding at frame 47 corrects cumulative generative errors, ensuring successful manipulation even with drifted prediction histories.

Analysis of Robustness and Ablations

Ablation studies confirm the contributions of both the embodiment-aware loss and closed-loop feedback. Removing either reduces success rates by 6–14% across benchmarks. Sensitivity analysis on the loss reweighting parameter η\eta shows Vidarc maintains high performance across a broad range of values, indicating stable behavior under hyperparameter variation. Figure 5

Figure 5: Persistent artifacts near robot arm in naive video generation models can degrade final task performance; Vidarc’s masking mechanism mitigates such artifacts.

Implications and Future Directions

Vidarc’s design has several notable theoretical and practical ramifications:

  • Scalable Cross-Embodiment Transfer: Pretraining on heterogeneous human and robot demonstrations enables rapid adaptation to unseen platforms, addressing the longstanding challenge of embodiment-specific sample inefficiency.
  • End-to-End Visual-Action Decoupling: The integration of MIDM and video generation underscores the viability of decoupling perception and action reasoning in high-dimensional robotic tasks, reducing reliance on fixed action vocabularies.
  • Acceleration and Latency Reduction: By leveraging causal generation and key-value caches, Vidarc approaches the practical latency requirements for real-world, interactive robot deployment, setting a precedent for future work on deploying foundation models in time-constrained control domains.
  • Error Correction and Safety: The closed-loop pipeline proactively grounds predictions after execution, crucial for operating in unpredictable or safety-critical environments. Figure 6

    Figure 6: Representative samples from Vidarc’s large-scale training/fine-tuning datasets demonstrate diversity across embodiments, contexts, and camera viewpoints.

Conclusion

Vidarc delivers a marked advancement in embodied video-based robotic control by achieving superior generalization, rapid closed-loop responsiveness, and robust error correction in both simulated and real domains. Its methodological contributions—causal autoregressive diffusion, action-relevant masking, and re-prefill-based closed-loop feedback—together address core challenges facing the deployment of foundation models for robot manipulation. These innovations suggest promising future research trajectories in scalable transfer, safety-aware feedback integration, and foundational model adaptation for dynamic environments.

Speculation on Future AI Developments

Looking forward, integration of communication between multiple visual world models, increasing multimodal grounding, and further acceleration of diffusion-based inference will be central. There is substantial headroom for architectural optimization (including hybrid autoregressive/non-autoregressive regimes), distillation for hardware efficiency, and leveraging foundation models to expand zero-shot/one-shot learning not just in manipulation but also in mobile, multi-agent, and dexterous domains. Figure 7

Figure 7: The Aloha robot platform and its hardware specifications demonstrate Vidarc’s deployment context, featuring high-DOF manipulators and multi-view sensing for complex bimanual operations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.