Conditional AR Video Generation

Updated 23 December 2025

Conditional autoregressive video generation is a method that sequentially produces video frames by conditioning on past frames and external signals such as text or images.
It combines masked autoregression with diffusion models, transformer-based planning, and cache-enabled inference to balance visual fidelity and computational efficiency.
The approach addresses challenges like exposure bias and scalability in long video sequences, with applications in image-to-video conversion and interactive world modeling.

Conditional autoregressive video generation refers to a family of generative modeling strategies in which each video frame (or block of frames/tokens) is sampled sequentially, conditioned on a combination of (a) previously generated content, (b) explicit conditioning signals (e.g., text, image, action trajectory, or user control), and (c) often, additional context or planning representations. Over the last several years, the principal architectural motifs have evolved from discrete token autoregression to hybrid schemes involving diffusion, transformers, and curriculum-guided inference to balance visual fidelity, computational efficiency, and flexibility of conditioning. This article surveys the major algorithmic foundations, architectural innovations, conditional mechanisms, efficiency optimizations, experimental validations, and open technical challenges in conditional autoregressive video generation with a focus on diffusion-centric and hybrid models.

1. Mathematical Formulations and General Principles

Conditional autoregressive video generation decomposes the conditional joint likelihood of a T-frame video sequence $x_{1:T}$ as a product of transition distributions, each sequentially conditioned on past frames and (possibly) conditioning variables $c$ : $p(x_{1:T} \mid c) = \prod_{t=1}^T p(x_t \mid x_{<t}, c)$ Here, $c$ can encode arbitrary external input, including images (image-to-video), text, map/layout, or interactive controls (for world models) (Liu et al., 26 Oct 2024, Gao et al., 25 Nov 2024, Deng et al., 18 Jul 2024, Chen et al., 28 May 2025).

Models operationalize this factorization at various granularities:

Frame-level AR: Each frame $x_t$ is generated conditioned on true/generated $x_{<t}$
Patch/token-level AR: Each spatial or spatiotemporal token is generated sequentially
Block-wise AR: Video is partitioned into blocks; each block is generated conditioned on prior blocks

Some methods use masking-based autoregression, where the generative process is formulated as predicting masked frames or patches given unmasked (or reference) content. This enables a single model to handle diverse conditional tasks: interpolation, image-to-video, expansion, and arbitrary masking (Liu et al., 26 Oct 2024, Zhou et al., 21 Jan 2025).

2. Model Architectures and Conditional Mechanisms

2.1 Planning and Decoding

A prototypical modern architecture, as in MarDini, combines a masked autoregressive (MAR) transformer for temporal planning with a conditional diffusion model (DM) for high-fidelity spatial frame synthesis:

The MAR model operates on low-resolution latents with masked positions, generating per-frame planning signals by attending both spatially and temporally across frames. Transformer depth and parameter allocation are biased toward this stage, enabling computation-intensive spatiotemporal attention (Liu et al., 26 Oct 2024).
The DM denoises noisy high-resolution latent frames, conditioned via cross-attention on the MAR planning outputs. Temporal attention in DM is often limited for efficiency.

2.2 Causal Attention and Cache-Enabled Inference

Recent advances—e.g., Ca2-VDM, ViD-GPT—enforce causal temporal self-attention in their transformers (strict lower-triangular masking in attention), strictly limiting each time index's receptive field to past and current positions. Coupled with key-value caching (KV-cache), the bulk of computation for already-generated context does not need to be recomputed, yielding linear inference cost in video length (Gao et al., 25 Nov 2024, Gao et al., 16 Jun 2024).

Ca2-VDM further innovates by splitting temporal and spatial attention, introducing a dedicated, small spatial prefix cache to maintain high-fidelity guidance from a limited context window, and dramatically reducing chunk-wise recomputation via temporal cache sharing.

2.3 Block-wise AR and Interpolation Between AR and Diffusion

ACDiT introduces block-wise conditional diffusion with the Skip-Causal Attention Mask (SCAM): generation alternates between AR-diffusion blocks and context buildup in the KV-cache. By tuning block size, ACDiT interpolates between classical token-wise AR and full-sequence diffusion, enabling both scalability and sample quality (Hu et al., 10 Dec 2024).

2.4 Tokenization Regimes

Continuous latent tokens: VAE-encoded continuous-valued representations (e.g., VideoMAR) enable spatial and temporal extrapolation, efficient storage, and reduced rasterization cost (Yu et al., 17 Jun 2025).
Discrete tokens: VQ-VAE or VQGAN encoders quantize each frame into indices; ARCON, EndoGen, and some early systems use this for frame-wise or grid-wise AR generation (Ming et al., 4 Dec 2024, Liu et al., 23 Jul 2025).
Hybrid approaches: MarDini, ARVAE combine continuous and spatially decoupled bottlenecks for efficient, scalable encoding (Shen et al., 12 Dec 2025).

3. Training and Inference Strategies

3.1 Losses and Objectives

Diffusion Denoising Loss: Models like MarDini and VideoMAR use stochastic "next-frame" diffusion losses in latent space, sometimes with velocity or noise-prediction heads and classifier-free guidance (Liu et al., 26 Oct 2024, Yu et al., 17 Jun 2025).
Likelihood-based Losses: Normalizing-flow models (e.g., VideoFlow) directly maximize log-likelihood, exploiting invertibility for efficient sampling (Kumar et al., 2019).
Adversarial Objectives: AAPT applies post-hoc adversarial training using a student-forcing regime (autoregressive sampling during training), combining relativistic GAN loss and framewise reconstruction to induce sharp, consistent per-frame quality and to address compounding error (Lin et al., 11 Jun 2025).

3.2 Teacher Forcing Regimes

MAGI introduces Complete Teacher Forcing (CTF), where masked frames are conditioned on complete (unmasked) observations, aligning training and inference and enabling efficient, low-memory frame-level AR with a diffusion head. Masked Teacher Forcing (MTF), by contrast, produces a train-test mismatch and higher exposure bias. CTF enables $\mathcal{O}(T)$ scaling and long-sequence stability (Zhou et al., 21 Jan 2025).

3.3 Curriculum and Progressive Learning

VideoMAR employs short-to-long curriculum learning for both temporal and spatial scale: initial training on short clips and low spatial resolution, staged scaling to longer durations and higher resolution (Yu et al., 17 Jun 2025). Similarly, ARVAE uses multi-stage training to prevent early overfitting and stabilize long-horizon decay (Shen et al., 12 Dec 2025).

4. Conditioning, Extrapolation, and Guidance

4.1 Conditional Scaffold

Conditioning sources include:

First image (image-to-video generation) (Liu et al., 26 Oct 2024, Yu et al., 17 Jun 2025)
Preceding context frames (interpolation/expansion) (Liu et al., 26 Oct 2024, Gao et al., 25 Nov 2024)
Arbitrary masks (via masking schedule) (Liu et al., 26 Oct 2024, Zhou et al., 21 Jan 2025)
Language/text prompt (text-to-video, scene control) (Deng et al., 18 Jul 2024, Gao et al., 25 Nov 2024)
Action trajectory or control signals (interactive future prediction) (Chen et al., 28 May 2025, Lin et al., 11 Jun 2025)
Semantic class or global state vectors (e.g., EndoGen, VRAG) (Liu et al., 23 Jul 2025, Chen et al., 28 May 2025)

4.2 Extrapolation and Arbitrary Length Generation

Transformers with rotary or block-wise position embeddings can extrapolate to temporally longer (or spatially larger) outputs than seen during training (Yu et al., 17 Jun 2025, Hu et al., 10 Dec 2024). Proper architectural causality and cache strategies support efficient unrolling to hundreds or thousands of frames, as demonstrated in VideoMAR, Streetscapes, and AAPT.

4.3 Specialized Retrievers and Memory

For interactive world modeling tasks, standard AR models accumulate error and are limited by fixed context. VRAG supplements current context with explicit frame retrieval from a memory buffer based on state similarity, training the model to utilize these long-range anchors—critical for spatiotemporal consistency in long sequences (Chen et al., 28 May 2025).

5. Computational Efficiency, Optimization, and Trade-offs

Conditional AR video generation imposes unique computational trade-offs:

Asymmetric parameter allocation (MarDini): Most compute and parameters go to the low-res MAR planner, enabling full spatiotemporal attention; high-res DM is lightweight (Liu et al., 26 Oct 2024).
Linear scaling with chunk/block AR: Ca2-VDM and ACDiT use chunked autoregression with KV-cache for temporal context reuse and avoid $O(T^2)$ complexity (Gao et al., 25 Nov 2024, Hu et al., 10 Dec 2024).
1NFE AR generation: AAPT converts diffusion models to single-step AR generators, yielding real-time video streaming up to 1,440 frames on commodity hardware (Lin et al., 11 Jun 2025).
Parallel spatial patch generation: VideoMAR masks and reconstructs spatial subsets per frame, yielding high throughput (Yu et al., 17 Jun 2025).

Empirical findings consistently show SOTA FVD/FID scores, large speedups over bidirectional or unoptimized AR schemes, and stable long-sequence generation with these methods.

6. Experimental Results and Comparative Analysis

Model	Scenario	Dataset	FVD	Unique Features
MarDini	Interp./I2V	DAVIS/UCF-101/VIDIM	99.05/204.2/102.9	Masked AR planning + diffusion (Liu et al., 26 Oct 2024)
Ca2-VDM	Long AR	MSR-VTT/UCF-101	181/184.5	Chunk AR, cache sharing, causal attention (Gao et al., 25 Nov 2024)
VideoMAR	I2V, Extrap.	VBench-I2V	84.82	Curric. learning, 3D-RoPE, KV-cache (Yu et al., 17 Jun 2025)
ACDiT	Class-cond. AR	UCF-101	90	Blockwise AR-diffusion, SCAM, flexible AR (Hu et al., 10 Dec 2024)
ARCON	Driving contin.	nuScenes/BDD100K	57.6	Structural + texture tokens, flow warping (Ming et al., 4 Dec 2024)
EndoGen	Med. cond. AR	HyperKvasir/SurgVisdom	507.2/1393.6	Grid-frame AR, semantic-aware masking (Liu et al., 23 Jul 2025)
AAPT	Realtime AR	Diverse	—	1NFE, adversarial training, student-forcing (Lin et al., 11 Jun 2025)

Key observations:

MarDini, Ca2-VDM, and ACDiT achieve SOTA FVD/FID at O(T) or O(B) scaling.
KV-cache and causal attention are critical for both speed and long-term coherence.
Exposure bias and compounding error, a recurring issue in AR models, are effectively ameliorated by curriculum learning, student-forcing, and memory retrieval strategies.
Domain-specific adaptations (EndoGen for medicine, ARCON for driving) leverage architectural AR for domain constraints and measurable downstream improvements.

7. Open Challenges and Future Directions

Despite progress, several technical challenges remain:

Exposure bias and compounding error are inherent in autoregressive decoders; while methods like student-forcing and retrieval augmentation help, fully eliminating drift in highly stochastic environments remains an open problem (Chen et al., 28 May 2025, Lin et al., 11 Jun 2025).
Scalability to multi-minute, high-resolution, or multiview outputs demands further innovations in attention mechanisms, memory, and compute allocation (Liu et al., 26 Oct 2024, Lin et al., 11 Jun 2025).
Generalization beyond standard conditioning modes (e.g., real-time interactive or event-driven synthesis) is under exploration; robust architectures that support multi-modal, controllable, and world-consistent generation are an area of active research (Chen et al., 28 May 2025, Lin et al., 11 Jun 2025, Hu et al., 10 Dec 2024).
Methods for long-range semantic and physical consistency, possibly integrating external world models or unsupervised structure discovery, are needed for applications in embodied AI and planning (Ming et al., 4 Dec 2024, Chen et al., 28 May 2025).

In summary, the conditional autoregressive paradigm—including masked, causal, block-wise, and retrieval-augmented strategies—provides a unifying, scalable, and flexible foundation for high-fidelity video generation, supporting both open-ended synthesis and application-driven interactive tasks. Key advances are anchored in architectural causality, cache-enabled inference, hybrid AR-diffusion training, and curriculum-based regularization, collectively pushing the limits of temporal horizon and conditional controllability in generative video modeling (Liu et al., 26 Oct 2024, Gao et al., 25 Nov 2024, Hu et al., 10 Dec 2024, Yu et al., 17 Jun 2025, Zhou et al., 21 Jan 2025, Chen et al., 28 May 2025, Shen et al., 12 Dec 2025, Deng et al., 18 Jul 2024, Ming et al., 4 Dec 2024, Liu et al., 23 Jul 2025, Lin et al., 11 Jun 2025).