Papers
Topics
Authors
Recent
2000 character limit reached

Autoregressive Video Diffusion Models (AR-VDMs)

Updated 20 December 2025
  • Autoregressive Video Diffusion Models (AR-VDMs) are generative frameworks that produce video frames sequentially with causal conditioning and diffusion denoising.
  • They enable scalable long-range video synthesis, efficient streaming, and modular control through advanced memory and temporal factorization techniques.
  • Key challenges include error accumulation, memory bottlenecks, and balancing compression efficiency with high-fidelity generation.

Autoregressive Video Diffusion Models (AR-VDMs) are a subclass of generative video models in which the video sequence is produced framewise or chunkwise via conditional denoising diffusion, subject to strict temporal causality: each frame or latent group is generated conditioned only on past observations and/or auxiliary inputs. This formalism enables scalable long-range synthesis, efficient streaming, and modular control, while presenting unique architectural and theoretical challenges concerning error accumulation, memory capacity, compression, and generation efficiency. AR-VDMs subsume a range of ancestor models, including teacher-forced bidirectional VDMs, causal transformers, streaming residual diffusions, diffusion-token AR-LMs, and masked AR-planning pipelines.

1. Model Architectures and Temporal Factorization

Autoregressive video diffusion models employ distinct architectural mechanisms to ensure causal generation. Common factorization is

p(x1:T)=t=1Tp(xtx<t,ct),p(x_{1:T}) = \prod_{t=1}^T p(x_t \mid x_{<t}, c_t),

where xtx_t denotes the latent (or pixel) state of frame tt and ctc_t can encode text, trajectory, reference frames, or domain priors (Li et al., 5 Dec 2024, Yang et al., 2022, Gao et al., 16 Jun 2024, Zhao et al., 9 Oct 2025). Within each generative step, the model applies a multi-step or few-step diffusion denoising chain over SS discrete timesteps, starting from noise and iteratively reconstructing the clean frame conditioned on prior context.

Several variants exist:

  • Continuous Token AR-LMs: DiCoDe (Li et al., 5 Dec 2024) introduces “diffusion-compressed deep tokens” trained with a diffusion decoder; AR-LMs (GPT, Llama) autoregressively model these tokens, yielding \sim1000×\times sequence compression versus conventional VQ methods.
  • Causal Attention Transformers: ViD-GPT (Gao et al., 16 Jun 2024) and Ca²-VDM (Gao et al., 25 Nov 2024) employ strictly lower-triangular attention masks, ensuring each token or frame attends only to earlier positions, transforming per-clip bidirectional VDMs into scalable causal generators.
  • Residual Correction Decomposition: Compress-RNN or U-Net predictors synthesize a deterministic next-frame guess, augmented by stochastic diffusion-generated residuals (Yang et al., 2022).
  • Masked AR Planning: MarDini (Liu et al., 26 Oct 2024) and ARLON (Li et al., 27 Oct 2024) decouple coarse temporal planning (via AR-transformers over VQ codes) from high-res spatial generation (diffusion denoising conditioned on planning signals).

Hybrid architectures incorporate stateful memory: RAD (Chen et al., 17 Nov 2025) and VideoSSM (Yu et al., 4 Dec 2025) fuse windowed local caches with global recurrent (LSTM, SSM) or compressed representations to maintain long-range consistency.

2. Diffusion Frameworks and Training Regimes

The diffusion backbone is typically a discretized stochastic process, Markovian in timestep tt: q(xtxt1)=N(xt;1βtxt1,βtI),q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I), with closed-form marginalization: xt=αˉtx0+1αˉtϵ,ϵN(0,I),x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), where αˉt=s=1t(1βs)\bar\alpha_t = \prod_{s=1}^t (1-\beta_s) (Li et al., 5 Dec 2024, Zhao et al., 9 Oct 2025, Yang et al., 2022).

Reverse denoising is either ϵ\epsilon-prediction or x0x_0-estimation, optionally parameterized for each frame ii by past context and current timestep embedding. Training objectives include standard MSE losses, variational lower-bounds, or score-matching (Yang et al., 2022).

Many AR-VDMs introduce additional regularization or scheduling:

3. Causal Attention, Memory, and Compression Mechanisms

Architectural innovations focus on efficient temporal context:

Model Temporal Mechanism Memory Handling
ViD-GPT Causal attn, KV-cache Reuse of all past KV features
Ca²-VDM Causal attn, cache sharing Fixed-size FIFO, O(KlPₘₐₓ) cost
RAD/VideoSSM LSTM/SSM hybrid Sliding window + state compression
AR-Diffusion Temporal causal attn Non-decreasing timesteps
DiCoDe AR LLM over tokens 1000× deep-token compression
MarDini Masked AR planning over VQ Asymmetric high/low-res attention

Notably, cache reuse in Ca²-VDM (Gao et al., 25 Nov 2024) converts AR generation complexity from quadratic to linear in sequence length, while memory-augmented models (RAD, VideoSSM (Chen et al., 17 Nov 2025, Yu et al., 4 Dec 2025)) enable hundreds to thousands of temporally coherent frames via fusion of local attention with state-space or LSTM compression.

Compression bottlenecks are analytically demonstrated in Meta-ARVDM (Wang et al., 12 Mar 2025): the KL-divergence between generated and true videos grows with both error accumulation and an unavoidable “memory bottleneck.” Practical architectures mitigate this by prepending, channel-concatenating, or cross-attending to compressed summaries of past frames, with ablation studies confirming empirical trade-offs.

4. Sampling Pipelines, Inference Efficiency, and Streaming Generation

AR-VDMs support variable-length and streaming inference by virtue of their causal structure. Typical sampling involves:

  1. Prepare conditioning (text prompt, motion trajectory, previous frame cache).
  2. For each new frame/chunk:
    • Sample initial latent from N(0,I)\mathcal{N}(0, I) or re-corrupt previous frames for progressive schedules.
    • Iteratively denoise via diffusion steps, using cached context and adaptive attention.
    • Update memory or cache (local or global as applicable).
  3. Concatenate or stitch frames to assemble video sequence.

Optimizations include:

5. Evaluation Methods and Empirical Results

State-of-the-art AR-VDMs are evaluated on short-clip (MSR-VTT, UCF-101), long-form (minute-scale), and action-conditioned benchmarks (DMLab, Minecraft). Metrics include:

  • Fréchet Video Distance (FVD): Lower is better; models such as Ca²-VDM (Gao et al., 25 Nov 2024), AR-Diffusion (Sun et al., 10 Mar 2025), and DiCoDe (Li et al., 5 Dec 2024) consistently achieve or approach SOTA FVD scores across datasets (e.g., Ca²-VDM: FVD=181 on MSR-VTT, AR-Diffusion: FVD₁₆=186.6 on UCF-101).
  • CLIPSIM/IS scores: Semantic alignment to prompt; DiCoDe (Li et al., 5 Dec 2024) and ART·V (Weng et al., 2023) are competitive with much larger pretrain baselines.
  • Motion Consistency/Smoothness: AR-Drag (Zhao et al., 9 Oct 2025): 4.37 motion consistency, latency 0.44s, clearly exceeding previous controllable motion VDMs.
  • Temporal continuity and drift: ViD-GPT (Gao et al., 16 Jun 2024) introduces Step-FVD/ΔEdgeFD for chunkwise drift analysis; qualitative plots show flat frame-difference curves and lower stepwise FVD versus bidirectional or less causal baselines.
  • Streaming and interpolation: MarDini (Liu et al., 26 Oct 2024) achieves SOTA FVD=99.05 for interpolation (17 frames @512) and sub-second per-frame latency without image pretraining.

Theoretical analysis (Meta-ARVDM (Wang et al., 12 Mar 2025)) reveals a Pareto frontier: increased context reduces memory bottleneck at the expense of error accumulation and efficiency.

6. Limitations, Open Problems, and Prospective Directions

Several persistent limitations arise from both theory and empirical studies:

  • Memory Bottleneck: Information-theoretically unavoidable unless unbounded context is accommodated. Memory budget compression trades off between global consistency and error propagation (Wang et al., 12 Mar 2025).
  • Error Accumulation: AR sampling accumulates KL-divergence with rollout length; mitigated by richer cache, SSM, or progressive noise schedules (Xie et al., 10 Oct 2024, Yu et al., 4 Dec 2025).
  • Scene boundaries and stochastic motion: Tokenizer architecture (e.g., DiCoDe (Li et al., 5 Dec 2024)) assumes smooth reconstructibility between head/tail tokens; hard scene cuts or stochastic dynamics may degrade results.
  • Data and domain bias: WebVid/YouTube sources lead to imbalance (nonrigid > rigid scenes); long-form rigid motion modeling remains weaker (Li et al., 5 Dec 2024, Li et al., 27 Oct 2024).
  • Scalability: Fixed context windows or linear memory capacity saturate at minute-scale; hybrid or multi-scale memory modules remain an active research area (Yu et al., 4 Dec 2025, Chen et al., 17 Nov 2025).
  • Efficient adaptation and robustness: Noise schedule adaptation, adaptive compression, LoRA-style feedforward refiners (e.g., AutoRefiner (Yu et al., 12 Dec 2025)), and robust teacher/student pipelines are evolving to minimize artifacts under long autoregressive rollouts.

Future directions target:

7. Representative Models and Comparative Table

Model AR Mechanism Core Innovation Notable Results
DiCoDe (Li et al., 5 Dec 2024) AR-LM, diff-compressed tokens 1000× token compression Minute-long, scalable video, FVD=367 (16f)
Ca²-VDM (Gao et al., 25 Nov 2024) Causal attention, cache sharing Linear complexity, KV reuse FVD=181 (MSR-VTT), 52s/80 frames
RAD (Chen et al., 17 Nov 2025) DiT+LSTM hybrid, pre-fetch Frame-wise AR, memory fusion Improved SSIM/LPIPS, >1000 frames
MarDini (Liu et al., 26 Oct 2024) Masked AR planning, asymmetric S-T planning/generation split SOTA FVD interpolation, 0.5s/frame
AR-Diffusion (Sun et al., 10 Mar 2025) Non-decreasing steps, causal FoPP/AD scheduler, async-gen FVD=40.8 (Sky), best cross-domain
ARLON (Li et al., 27 Oct 2024) AR VQ-VAE + DiT fusion Norm-based semantic injection Top dynamic/consistency/efficiency
VideoSSM (Yu et al., 4 Dec 2025) State-space global memory Hybrid SSM+local cache Best minute-scale fidelity/stability
AutoRefiner (Yu et al., 12 Dec 2025) Pathwise reflective LoRA Context-sensitive noise refinement +0.7 VBench, 6fps, no reward hacking

Each model’s strengths and empirical advances are defined by architectural choices concerning temporal causality, memory handling, tokenization/compression, and cache efficiency.


Autoregressive video diffusion models have redefined the scalability and fidelity envelope for generative video synthesis, achieving minute-long coherent motion, interactive real-time control, and streaming-friendly architectures. Their foundation in causal modeling, efficient memory, and compressor-guided AR mechanisms continues to drive active research at the intersection of multimodal generation, memory theory, and scalable inference.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Autoregressive Video Diffusion Models (AR-VDMs).