Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
88 tokens/sec
Gemini 2.5 Pro Premium
45 tokens/sec
GPT-5 Medium
37 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
91 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
103 tokens/sec
2000 character limit reached

Hybrid AR-Diffusion Models

Updated 18 August 2025
  • Hybrid AR-Diffusion models are generative architectures that unify order-agnostic autoregressive and absorbing diffusion techniques for flexible data generation.
  • They use a diffusion-like training objective that predicts masked tokens in parallel, reducing computational cost and bypassing strict sequential constraints.
  • The models achieve strong performance in tasks like lossless compression by leveraging adaptive scheduling and dynamic programming for efficient, parallel token prediction.

Hybrid AR-Diffusion models are a class of generative models that integrate autoregressive (AR) mechanisms and diffusion processes, creating a flexible architecture for discrete and continuous data generation that generalizes and unifies classical order-agnostic AR models and modern discrete diffusion models. The most canonical instance, Autoregressive Diffusion Models (ARDMs), encapsulates both paradigms by predicting masked tokens in a randomized order via a diffusion-like objective, sidestepping the strict architectural and generative constraints of purely sequential or purely parallel models.

1. Foundations and Model Architecture

ARDMs are defined as a strict generalization of order-agnostic autoregressive models (OA-ARMs) and absorbing state discrete diffusion models. Formally, rather than factorizing the joint likelihood strictly as

logp(x)=t=1Dlogp(xtx1:t1),\log p(x) = \sum_{t=1}^D \log p(x_t|x_{1:t-1}),

as in standard ARMs, ARDMs use a diffusion-like objective that predicts a group of masked or "absorbed" tokens simultaneously, conditional on a random subset of observed variables. This is achieved by rewriting the OA-ARM lower bound as

logp(x)EtU({1,,D})[DLt]\log p(x) \geq \mathbb{E}_{t\sim \mathcal{U}(\{1, \ldots, D\})}[D \cdot \mathcal{L}_t]

where

Lt=1Dt+1EσU(SD)kσ(t)logp(xkxσ(<t)).\mathcal{L}_t = \frac{1}{D-t+1} \mathbb{E}_{\sigma \sim \mathcal{U}(S_D)}\sum_{k \in \sigma ( \geq t)} \log p(x_k | x_{\sigma(<t)}).

The neural network ff that parameterizes the conditional distributions ingests an input vector with explicit binary masking: xinput=mx+(1m)xax_{\text{input}} = m \odot x + (1-m) \odot x_a (xax_a is the absorbing token). Causal masking is not required architecturally; dependency on masked versus observed dimensions is explicitly indicated via mm. For higher-order dependencies, a staged generation procedure can be used, wherein most significant bits or coarse structures are generated first and successive ARDM steps upscale to finer representations.

2. Training Regime and Computational Efficiency

ARDMs are trained without causal masking by averaging the objective over all possible token orderings and time steps. For a given (x,t,σ)(x, t, \sigma) triplet, the network is trained to predict all tokens masked at the tt-th permutation index. Each model update thus covers multiple prediction subtasks per optimization step, unlike standard AR models which sum over DD steps but only train on the next-token prediction at each step. This reweighting is crucial for efficiency in high-dimensional data, as the objective "decouples" the cost per update from data dimensionality. As a result, ARDMs avoid the computational bottleneck of strict sequential sampling and can be trained with efficient minibatch schedules.

3. Parallel Generation and Adaptive Scheduling

Unlike standard ARMs which are shackled by strict left-to-right generation (one token per network call), ARDMs can produce kk tokens per generative step, conditional on the set of previously unmasked tokens. The log-likelihood contribution for a set of masked tokens is identical in expectation under the uniform distribution over orderings, so the marginal cost is kLtk \cdot \mathcal{L}_t for kk tokens. A dynamic programming routine (detailed in the paper) constructs an optimal trade-off between speed (fewer network calls) and accuracy (likelihood), effectively adapting to any user-specified sampling or runtime budget. In practice, ARDMs can use $20$–$50$ parallel generative steps on high-dimensional inputs and achieve comparable performance to models requiring hundreds or thousands of steps.

Parallel Generation Table

Model Typical Steps Parallel Steps Possible Typical Speedup
AR (naive) DD 1 baseline
Discrete Diff. $1000$ 1 slow (per step cost)
ARDM DD (max) kk (k1k \gg 1) D/k\sim D/k per call

ARDMs are thus suitable for latency-sensitive applications where generation cost must be amortized over large blocks of tokens or dimensions.

4. Applications in Lossless Compression

ARDMs possess an explicit factorization of p(x)p(x), so negative log-likelihoods (in nats or bits per dimension) are directly usable as entropy coding rates. ARDMs can be used with range coders such as rANS, achieving competitive or superior bits per dimension (bpd) on benchmarks such as CIFAR-10: e.g., 2.71 bpd (with upscaling variant) versus 3.26 bpd for IDF++ and bits-back baselines. Notably, ARDMs can compress individual data points, overcoming the overhead of joint dataset coding (bits-back). Parallel scheduling enables compression/decompression using modest network calls—crucial for scalable or online compression tasks.

5. Comparison and Theoretical Connections to Discrete Diffusion Models

Absorbing-state discrete diffusion models (e.g., D3PM-absorbing) require hundreds to thousands of reverse steps, slowly inverting a stochastic masking process. ARDMs represent the infinite-time (continuous-time) limit of these models and, by contrast, achieve similar NLL or generative quality with far fewer steps. For instance, in modeling sequences of 250 tokens, ARDM attains \sim1.43 bits/character in 250 steps, whereas D3PM-absorbing uses 1000 steps for similar performance. The efficiency is inherited from the reformulated, reweighted loss (single-step objective) and dynamic programming for parallel prediction. The architectural relaxation—decoupling from strict causality—enables scalable model deployment on high-dimensional and structured data.

6. Hybridization Potential and Model Extensions

ARDMs, as a unification of AR and diffusion mechanisms, readily lend themselves to hybrid models. Possible directions include:

  • Integrating latent diffusion layers for initial structure (e.g., conditional or unconditional diffusion on coarse features), followed by ARDM for fine texture refinement.
  • Multi-stage, coarse-to-fine resolution via upscaling with ARDM at each stage; early stages flexibly predict global components while successive stages add detail.
  • Using ARDM to seed diffusion or iterative refinement networks, leveraging the equivalence to absorbing diffusion (as shown in the Appendix) for initialization and quick convergent sampling.
  • Layerwise hybrids, where classical AR decoding is interleaved with blocks of masked parallel denoising, connecting transformer block scheduling with ARDM's masking formalism.

This flexible architecture supports novel applications (e.g., compressed generative modeling, multimodal transformers), efficient conditional sampling, and highly adaptive generative regimes bridging the dichotomy between strict sequentiality and unfettered parallelism.


ARDMs establish a principled and practically efficient merger of autoregressive and diffusion-based generative modeling. By relaxing causal constraints, using a diffusion-like training objective, and enabling optimally scheduled parallel prediction and loss computation, they provide a blueprint for scalable, high-fidelity, and adaptive generative systems. The framework generalizes classical AR and absorbing diffusion models, while furnishing strong empirical results (notably in lossless compression) and a robust theoretical foundation for hybrid model development.