Astra: Interactive World Model

Updated 13 December 2025

Astra is a general interactive world model that generates temporally coherent, action-aligned video predictions across diverse real-world environments.
The model employs autoregressive denoising diffusion, noise-augmented memory, and a mixture of action experts to balance historical context with responsive action control.
Empirical results demonstrate Astra’s superior performance in video fidelity, action alignment, and temporal consistency compared to contemporary methods.

Astra is a general-purpose, interactive world model designed to generate temporally coherent and action-aligned video predictions across a range of real-world environments, with applications in domains such as robotics, exploration, and autonomous driving. Its architecture is distinguished by a combination of autoregressive denoising diffusion, temporal causal attention, noise-augmented memory, and flexible, modular action conditioning. Astra addresses key open problems in interactive world modeling by supporting streaming rollout, precise action control, and high-fidelity long-term video simulation across heterogeneous action modalities (Zhu et al., 9 Dec 2025).

1. Model Architecture: Autoregressive Denoising and Temporal Causality

Astra is centered on an autoregressive denoising transformer framework built atop a pretrained video diffusion model (e.g., DiT-style). The video input is partitioned into discrete, non-overlapping "chunks" $z_1, \dots, z_N$ in a compressed latent space. At each generation step $i$ , Astra predicts the next chunk $z_i$ , conditioned on all preceding chunks $z_{<i}$ , actions $a_{1:i}$ , and (optionally) prompts $c$ . The generative process factorizes as: $p(z_{1:N} \mid a_{1:N}, c) = \prod_{i=1}^N p(z_i \mid z_{<i}, a_{1:i}, c)$ Each conditional distribution is implemented via a denoising flow-matching network $\vf_\theta$, using velocity supervision over a noisy line interpolation between real chunk $z$ and Gaussian noise $\varepsilon$ . The flow matching loss is: $\mathcal{L}(\theta) = \mathbb{E}_{i,\,t,\,z_{<i},\,z,\varepsilon} \big\| \vf_\theta(\tilde z_t, t \mid z_{<i}, a_{1:i}, c) - v^*(\tilde z_t, t \mid z_{<i}) \big\|_2^2$ where $\tilde z_t = (1 - t)z + t\varepsilon$ , $t \in [0, 1]$ , and $v^*$ is the ground-truth interpolation velocity (Zhu et al., 9 Dec 2025).

Within each Flow Transformer block, Astra enforces chunk-wise causal self-attention, ensuring that attention weights are only computed over past frames: $\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left( \frac{QK^\top}{\sqrt{d}} + M \right) V$ where $M$ is a binary causal mask.

2. Memory, History, and Balancing Temporal Coherence

Astra introduces noise-augmented history memory as a solution to the "visual inertia" problem—an over-reliance on clean history which impedes responsiveness to new actions. During training, the model corrupts past latent frames by adding Gaussian noise according to a binary mask: $Z_c^{\mathrm{noisy}} = Z_c + \mathbf{m} \odot \xi, \quad \xi \sim \mathcal{N}(0, \sigma^2 I)$ where $Z_c$ is the clean latent sequence, $\mathbf{m}$ is a binary mask, and $\sigma$ is the noise scale. This forces the model to learn to integrate both history and new action input, rather than shortcutting by simply copying prior context. At inference, the clean history is used. Longer context increases temporal consistency but can impair action following; noisy memory regularization was found to improve the tradeoff between coherence and response (Zhu et al., 9 Dec 2025).

3. Action Conditioning: ACT-Adapter and Mixture of Experts

For precise, heterogeneous action interaction, Astra incorporates an ACT-Adapter along with a Mixture of Action Experts (MoAE) module.

ACT-Adapter: After each self-attention block, a linear adapter is used to inject the current action embedding $e_i$ into the latent stream:

$h_{l+1} = \mathrm{FFN}(\mathrm{SelfAttn}(h_l)) + W_{\mathrm{act}} e_i$

where $W_{\mathrm{act}}$ is trainable and initialized near the identity. The adapter and self-attention layers are the only components fine-tuned during training; the remainder of the backbone remains frozen.

An action-free guidance (AFG) mechanism sharpens control at inference: $v_{\mathrm{guided}} = v_\theta(\tilde z_t, t \mid a) + s \big[ v_\theta(\tilde z_t, t \mid a) - v_\theta(\tilde z_t, t \mid \varnothing) \big]$

Mixture of Action Experts (MoAE): Astra is able to handle multiple action modalities (e.g., camera pose, robot effector, command tokens). For action modality $m$ , raw actions are projected: $\tilde a^m_i = R_m(a^m_i)$ , then routed by a gating network to select and aggregate a set of expert MLPs:

$e_i = \sum_{k \in \mathrm{TopK}(g)} \mathrm{softmax}(g_k) E_k(\tilde a^m_i)$

The resulting embedding is injected at every ACT-Adapter, with a binary flag marking “past vs. current” action (Zhu et al., 9 Dec 2025).

4. Training, Datasets, and Implementation Details

Astra is initialized from a Wan-2.1 pretrained video diffusion model (30 DiT blocks). The primary hardware configuration uses 8×H800 GPUs, batch size 1, 24 hours of training for convergence, and AdamW with a learning rate of $1\times10^{-5}$ for 30 epochs. Training is performed in compressed 3D-VAE latent space (480×832 pixel resolution).

Conditional frame counts per clip are drawn uniformly from 1 to 128, with the target frames set to 33. There is no explicit curriculum beyond random history lengths and noise masks. The model supports real-world, egocentric and multi-view video, camera pose, robotic action, and discrete command conditioning (Zhu et al., 9 Dec 2025).

Astra is trained jointly across five large datasets, with scenario/action coverage and dataset sizes as follows:

Dataset	Scenario	Action Modality	#Clips
nuScenes	Autonomous driving	7-dim vehicle pose	50,000
Sekai	First-person walking/drone	12-dim camera pose	50,000
SpatialVID	In-the-wild, mixed	7-dim cam/keyboard/mouse	200,000
RT-1	Robot manipulation	7-dim end effector pose	136,000
Multi-Cam Video	Synthetic multi-view	12-dim camera pose	10,000

Evaluation is performed on Astra-Bench, holding out 20 samples per dataset (Zhu et al., 9 Dec 2025).

5. Empirical Results and Comparative Evaluation

Astra achieves state-of-the-art performance on action alignment and video fidelity across multiple domains, as measured on Astra-Bench using both human and automated metrics:

Method	Inst. Fol.	Sub. Cons.	Bg. Cons.	Motion	Aesthetic	Image
Wan-2.1	0.061	0.854	0.903	0.958	0.489	0.691
MatrixGame 2.0	0.268	0.916	0.928	0.981	0.441	0.748
YUME	0.652	0.936	0.938	0.985	0.523	0.741
Astra	0.669	0.939	0.945	0.989	0.531	0.747

Astra also yields the lowest camera motion errors (rotation and translation) on actuation benchmarks, indicating superior instruction following and action alignment.

Method	RotErr (°)	TransErr (%)	Inst. Fol.
Wan-2.1	2.96	7.37	0.061
YUME	2.20	5.80	0.268
MatrixGame	2.25	5.63	0.652
NVM	2.47	6.13	0.311
Astra	1.23	4.86	0.669

Qualitative assessments demonstrate Astra's ability to generate smooth, temporally extended egocentric videos for world exploration tasks, responsive robot manipulation sequences, and plausible action-conditioned multi-agent driving trajectories. The model maintains long-horizon stability and avoids drift/error accumulation due to its noisy memory regularization and autoregressive denoising design (Zhu et al., 9 Dec 2025).

6. Comparative Context and Design Lineage

Astra builds upon design elements and insights from prior general world modeling work:

Pandora (Xiang et al., 12 Jun 2024): Demonstrates the efficacy of hybrid autoregressive LLM backbones with diffusion video decoders, vision–language fusion adapters, and large-scale instruction tuning for controllable, free-text action video simulation.
PAN (Team et al., 12 Nov 2025): Introduces the Generative Latent Prediction backbone, leveraging large LLMs and a diffusion transformer decoder for grounded, coherent long-horizon simulation and causal language-action conditioning.
iVideoGPT (Wu et al., 24 May 2024): Emphasizes compressive tokenization, cross-modality token streams, and scalable pretraining for model-based RL and visual planning.
Worldformer (Ammanabrolu et al., 2021): Proposes structured knowledge-graph-based, set-of-sequences approaches for domain-agnostic, partially observable world modeling, influencing the modularity and flexibility of state–action spaces in Astra.

Astra extends these approaches by introducing noise-augmented memory, mixture-of-action experts for heterogeneous action routing, and parameter-efficient action adapters within a fully autoregressive denoising diffusion backbone. This enables interactive, long-horizon video prediction with high action–state alignment and generality (Zhu et al., 9 Dec 2025).

7. Limitations and Future Directions

Astra encounters computational challenges due to the inherently high resource demands of autoregressive sampling and repeated denoising steps per chunk, which limit its real-time applicability in latency-sensitive settings. Long-term rollout efficiency and horizon extension remain constrained by memory and transformer context scaling.

Key open directions include distillation or low-rank adapter compression to accelerate inference, hierarchical or context-compressed rollouts for longer temporal coherence, and expansion to natural language command modalities and additional sensor streams (e.g., audio, depth, or text). These extend the blueprint for truly general, domain-agnostic world models with structured, interactive, and multimodal state–action simulation (Zhu et al., 9 Dec 2025, Xiang et al., 12 Jun 2024, Team et al., 12 Nov 2025, Wu et al., 24 May 2024, Ammanabrolu et al., 2021).