Astra: General Interactive World Model with Autoregressive Denoising (2512.08931v1)

Published 9 Dec 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.

Summary

The paper presents an autoregressive denoising framework that generates interactive and coherent future video sequences.
Its design integrates a pre-trained video diffusion backbone with an action-aware adapter and noise-augmented history memory.
Experimental results show superior long-horizon predictive fidelity across benchmarks for autonomous driving and robotic manipulation.

Astra: General Interactive World Model with Autoregressive Denoising

Introduction

The paper "Astra: General Interactive World Model with Autoregressive Denoising" (2512.08931) introduces Astra, an innovative world model designed to generate future video sequences based on past observations, actions, and real-world context. Building upon the advancements in diffusion transformers, this framework specifically addresses the need for long-horizon predictions in diverse scenarios such as autonomous driving and robotic manipulation. Astra distinguishes itself from existing models by employing an autoregressive denoising approach that combines high generative quality with the interactivity of real-world dynamics.

Methodology

Astra's architecture integrates a pre-trained video diffusion backbone enhanced with several key components to ensure both interactivity and consistency in video prediction:

Autoregressive Denoising Paradigm: At the heart of Astra is its autoregressive denoising process, which merges the extended temporal modeling of autoregression with the fidelity of diffusion models. This dual approach enables the generation of coherent video sequences that respond dynamically to new action inputs.
Action-Aware Adapter: To enable precise conditioning on agent actions, Astra incorporates an action-aware adapter. This module injects action signals directly into the denoising process, allowing the model to produce predictions that are responsive to user inputs while preserving the generative quality of the model.
Noise-Augmented History Memory: To balance the model’s responsiveness with temporal coherence, a noise-as-mask strategy is applied to historical frames. This soft corruption reduces over-reliance on past frames, compelling the model to integrate both historical and action cues for prediction.
Mixture of Action Experts (MoAE): Astra employs a mixture of action experts to handle diverse action modalities—such as camera controls and robotic movements—through a dynamic routing mechanism. This design enhances the model's versatility across multiple real-world tasks.

Experimental Evaluation

Astra was evaluated on a diverse set of benchmarks, including Sekai, SpatialVID, and nuScenes, demonstrating its superior performance in action-driven video prediction across various metrics. Experimental results showed that Astra consistently outperformed state-of-the-art models in terms of fidelity, long-range prediction, and action alignment. These results indicate Astra's capability to maintain visual coherence and dynamic consistency across extended temporal windows, effectively bridging the gap between video generation and interactive world simulation.

Implications and Future Directions

The development of Astra offers several implications for both practical applications and theoretical advancements in AI. Practically, Astra's ability to provide interactive and consistent video predictions opens new possibilities in autonomous driving, interactive robotics, and cinematic video production. Theoretically, Astra's framework sets a precedent for integrating autoregression with diffusion models, highlighting the potential for combining different generative approaches to achieve more robust world models.

In future work, addressing the inference efficiency of Astra could significantly enhance its applicability in real-time and latency-sensitive environments. Exploring distillation techniques to reduce inference costs while maintaining fidelity could lead to lightweight, scalable solutions for interactive AI models.

Conclusion

Astra represents a significant step forward in the field of generative world models. By successfully integrating autoregressive denoising with action-aware mechanisms, it provides a robust framework for interactive and adaptive video prediction. The results of this research indicate promising directions for developing more general and scalable simulators, potentially transforming applications in exploration, robotics, and beyond.