Causal Autoregressive Diffusion (CARD)

Updated 1 February 2026

CARD is a unified framework that integrates autoregressive and diffusion methods using strictly causal conditioning.
It employs lower-triangular masking and soft-tailed strategies to ensure efficient, stable, and sequential generation across modalities.
The approach demonstrates enhanced quality and speed in language, image, video, and control benchmarks via cache sharing and autoregressive decoding.

Causal Autoregressive Diffusion (CARD) is a framework that unifies autoregressive modeling with diffusion-based generative processes under strictly causal conditioning. The CARD paradigm systematically factors complex generative processes into a sequence of steps where each state depends only on its causal history, ensuring that information is processed in a forward, autoregressive manner aligned with a diffusion trajectory. This approach has been adopted in a range of domains—including language modeling, image synthesis, video generation, and visuomotor policy learning—leading to architectures that preserve both the stability and training efficiency characteristic of autoregressive models, as well as the multimodal generative flexibility of diffusion processes (Ruan et al., 29 Jan 2026, Gao et al., 29 May 2025, Xu et al., 2024, Gao et al., 2024, Ma et al., 17 Jun 2025).

1. Foundational Principles of CARD

CARD reinterprets the standard diffusion process—traditionally a Markov chain with iterative, often bidirectional, denoising—as a strictly causal, autoregressive procedure. In the CARD framework, the generative model predicts each token, pixel group, or temporally local observation based solely on the progressively constructed causal prefix, matching the sequentiality of autoregressive models while maintaining the stepwise refinement of diffusion.

Key innovations distinguishing CARD from prior hybrid or bidirectional approaches include:

Reformulating the forward (noising) and reverse (denoising) steps under strict lower-triangular (causal) attention masks.
Enabling interpretations where each autoregressive step aligns with a diffusion timestep, admitting either discrete or continuous time parameterizations.
Allowing dense, per-step, or per-token supervision via the diffusion objective while retaining efficient, parallelizable inference under certain masking or cache regimes.

This conceptual framework enables unification of per-token cross-entropy losses (as in ARMs) with diffusion-motivated score matching or denoising losses, and supports strict left-to-right (or past-to-future) information flow in masking, attention, and cache design (Ruan et al., 29 Jan 2026).

2. Core Methodologies and Masking Mechanisms

Forward and Reverse Processes

CARD generalizes the forward noising process—such as Gaussian or masked corruption—so that at each "causal" step, noise or masking is applied sequentially, and denoising predictions are conditioned only on the causal history. For example, in discrete language modeling, tokens may be replaced with a special mask token $[\mathrm{MASK}]$ under a continuous time schedule $\alpha(t)$ , and the denoising model reconstructs clean tokens using only previous, unmasked positions (Ruan et al., 29 Jan 2026).

Strict Causal Masking

A lower-triangular attention mask $M_{ij}$ is universally employed, ensuring that generation and denoising at position $i$ can only directly attend to positions $j\leq i$ or $j<i$ , making all generation strictly autoregressive:

$M_{ij} = \begin{cases} 0, & i > j \ -\infty, & i \le j \end{cases}$

This guarantee of unidirectional information flow is central to causal diffusion instantiations in autoregressive LLMs, image transformers, and video generation (Ruan et al., 29 Jan 2026, Xu et al., 2024, Gao et al., 2024).

Soft-Tailed and Local Context-Preserving Masks

CARD introduces tailored masking strategies—such as soft-tailed masking—to ensure that for early sequence positions, some local clean context is always present, mitigating instability arising from wholly masked prefixes. These methods selectively mask a variable tail window or sample masked positions within bounded segments of the sequence (Ruan et al., 29 Jan 2026).

3. Architectures and Sampling Algorithms

CARD implementations maintain architectural and sampling simplicity by leveraging "vanilla" autoregressive transformer backbones, attention masking, and standard tokenization or patchification procedures. This is exemplified in several domains:

Image Synthesis

In D-AR (Diffusion via Autoregressive Models), the image is tokenized into a 1D sequence of discrete codes via vector quantization post-transformer encoding. Tokens are grouped to correspond with successive diffusion denoising intervals:

AR Factorization: $P(z_1,\dots,z_N) = \prod_{i=1}^N P(z_i\,|\,z_{<i})$
Sampling: Sequential token generation, with each new group triggering a diffusion step that recovers progressively finer image details.
Streaming Previews: Intermediate token sets can be decoded directly for consistent partial previews, reflecting the evolving state along the diffusion trajectory (Gao et al., 29 May 2025).

Video Generation

Ca2-VDM and MSC decompose long videos into autoregressive blocks or frames, using strictly causal temporal attention. Cache sharing and unidirectional key/value reuse reduce computational redundancy:

Autoregressive Decomposition: $p_\theta(z_{0:L}) = \prod_{k=0}^{K-1}p_\theta(z^{P_k:P_{k+l}}_0\,|\,z^{0:P_k}_0)$
Causal Attention: Lower-triangular temporal masking, with spatial attention windows possibly incorporating short, recent prefix frames.
Cache Sharing: Prefix K/V caches are computed once per chunk and reused across all denoising steps, scaling inference from $O(K^3T)$ to $O(KT)$ (Gao et al., 2024, Xu et al., 2024).

Language Modeling

The CARD LLM extends masked diffusion approaches by supporting per-token, parallel denoising under strict autoregressive masking. Dynamic parallel decoding is unlocked, with blocks of [MASK] tokens denoised in parallel up to a confidence threshold, then committed and extended further using KV caching for throughput (Ruan et al., 29 Jan 2026).

4. Computational and Practical Considerations

Efficiency and Scaling

CARD models achieve ARM-level training speed due to the use of fully causal masks and per-token supervision, with no additional overhead from bidirectional or block-multihead masking. Empirically:

CARD LLMs train as fast as ARMs (1.0×), unlike block diffusion LMs or full-attention masked LMs, which incur 1.5–3× training time increases (Ruan et al., 29 Jan 2026).
Cache-optimized video models reduce inference complexity from $O(K^3T)$ (bidirectional, block-wise) to $O(KT)$ , enabling longitudinally consistent long-form generation with manageable memory and runtime (Gao et al., 2024).

Quality and Robustness

Across image, video, and policy domains, CARD models match or exceed the performance of classic AR or diffusion baselines while providing additional benefits:

D-AR-XL obtains FID = 2.09 on ImageNet 256×256, outperforming size-matched LlamaGen-XL and rivaling significantly larger AR models (Gao et al., 29 May 2025).
Ca2-VDM achieves state-of-the-art FVD (184.5) on UCF-101 and maintains temporal consistency in long autoregressive video synthesis (Gao et al., 2024).
Causal Diffusion Policy (CDP) in robot policy learning delivers consistently higher success rates in Adroit and RoboFactory benchmarks, with particular robustness under degraded input quality and long-horizon tasks (Ma et al., 17 Jun 2025).
CARD LLMs attain higher zero-shot accuracy and lower perplexity than discrete diffusion baselines (53.2% vs. 47.5% zero-shot; PPL = 34.4 vs. 49.8) while maintaining training efficiency (Ruan et al., 29 Jan 2026).

5. Empirical Results and Benchmark Summaries

A cross-domain summary of CARD results includes:

Domain	Model	Efficiency Metric	Quality Metric	Benchmark
Language	CARD	1.0× ARM speed, 1.7–4.0× gen speedup	PPL = 34.4	HellaSwag, MMLU, ARC, LM1B, etc.
Image	D-AR-XL	1.0× vanilla AR	FID = 2.09	ImageNet 256×256
Video	Ca2-VDM	FVD inference up to 3× faster	FVD = 184.5	UCF-101, MSR-VTT
Policy/Control	CDP	Real-time, chunk-wise KV cache	Success +5–20pp	Adroit, RoboFactory, Meta-World

Notably, Ca2-VDM and MSC demonstrate that causal autoregressive diffusion enables efficient, context-preserving, and computationally tractable video synthesis, with cache sharing and multi-scale spatio-temporal attention further reducing resource requirements (Xu et al., 2024, Gao et al., 2024).

6. Limitations and Open Challenges

Recognized constraints of current CARD frameworks include:

Tokenization Complexity: In image synthesis, the tokenizer (e.g., for D-AR) entails substantial parameter overhead (≈300M) and necessitates separate training of a diffusion decoder (Gao et al., 29 May 2025).
Cache Design: While cache sharing amortizes cost, handling variable-length or cyclic positional encodings in long-horizon generation remains an engineering challenge (Gao et al., 2024).
Causal Pretraining: Most VDM backbones are pretrained with bidirectional masks; performance could be enhanced with causal pretraining (Gao et al., 2024).
Long-Range Temporal Drift: Autoregressive video models still suffer gradual drift in very long sequences, motivating hybrid strategies with hierarchical caches or global refreshes (Gao et al., 2024).
Masking/Context Tuning: Hyperparameters for soft-tailed masking, context weighting, and chunk sizes must be tuned to trade off signal, stability, and computational cost (Ruan et al., 29 Jan 2026, Gao et al., 29 May 2025).

A plausible implication is that future research will explore adaptive masking, dynamic block sizing, curriculum strategies for variable-length context, and more general context fusion to extend the robustness and efficiency of CARD methods across broader domains.

7. Synthesis and Impact

CARD has emerged as a unifying framework for efficient, high-quality sequence modeling across modalities. By enforcing strict causality at every stage—through masking, autoregressive conditioning, and cache orchestration—CARD delivers ARM-grade data usage, diffusion-level sample quality, and generation throughput far exceeding classic ARMs when dynamic block decoding and cache reuse are leveraged (Ruan et al., 29 Jan 2026).

Experimental studies consistently demonstrate that proper realization of causal autoregressive diffusion leads to state-of-the-art or near state-of-the-art performance on major language, image, and video benchmarks, and robust visuomotor policy acquisition. Ongoing development focuses on optimizing cache utility, context weighting, multi-scale architecture blending, and causal pretraining, indicating CARD's sustained relevance for next-generation generative modeling (Ruan et al., 29 Jan 2026, Gao et al., 29 May 2025, Xu et al., 2024, Gao et al., 2024, Ma et al., 17 Jun 2025).