Papers
Topics
Authors
Recent
2000 character limit reached

Masked Auto-regressive Diffusion Models

Updated 26 November 2025
  • Masked Auto-regressive Diffusion Models are deep generative models that merge autoregressive and diffusion-based denoising techniques to control generation order.
  • They leverage coordinate-wise masking schedules and loss decomposition to enable adaptive decoding orders and efficient token reconstruction.
  • Extensions to hierarchical, video, and reinforcement learning applications, along with distillation methods, demonstrate improved sample quality and accelerated inference.

Masked Auto-regressive Diffusion Models (MAR) are a versatile class of deep generative models that unify the strengths of autoregressive sequence modeling and diffusion-based denoising, providing fine-grained control over generation order and offering advantages in diverse discrete and continuous domains. The following sections detail their mathematical underpinnings, formal equivalence to learned-order autoregressive models, algorithmic frameworks, recent hierarchical and video extensions, and acceleration via distillation for efficient applications including reinforcement learning.

1. Mathematical Foundations and Loss Decomposition

Masked Diffusion Models (MDMs)—including MAR—operate by progressively masking (corrupting) tokens in a sequence x0=(x01,,x0L)x_0=(x_0^1,\dots,x_0^L) according to a continuous-time process, followed by attempting to reconstruct the original tokens using learned conditional distributions. The forward process employs a masking schedule α(t)\alpha(t) with t[0,1]t\in[0,1], determining the rate at which each token is masked. The evidence lower bound (ELBO) in the TT\to\infty limit admits a continuous-time formulation:

LMDM=Ex0pdata[logpθ(x0)]01t1t=1LExtq(x0)[logpθ(x0xt,t)]dt\mathcal{L}_{\mathrm{MDM}} = -\mathbb{E}_{x_0\sim p_{\rm data}}\bigl[\log p_\theta(x_0)\bigr] \le \int_{0}^{1} \frac{t}{1-t} \sum_{\ell=1}^{L} \mathbb{E}_{x_t\sim q(\cdot|x_0)} \left[-\log p_\theta(x_0^\ell|x_t, t)\right] dt

where q(xtx0)q(x_t|x_0) is the forward masking process and pθ(x0xt,t)p_\theta(x_0^\ell|x_t,t) the learned reconstruction probability.

By promoting scalar tt to a vector t=(t1,,tL)t=(t_1,\ldots,t_L) with independent coordinate-wise masking α(t)\alpha_\ell(t_\ell), the ELBO generalizes to:

LMDM=[0,1]L=1Lt1tExtq(x0)[logpθ(x0xt)]j=1L(αj(tj))dt\mathcal{L}_{\mathrm{MDM}} = \int_{[0,1]^L}\sum_{\ell=1}^L \frac{t_\ell}{1 - t_\ell} \mathbb{E}_{x_t\sim q(\cdot|x_0)} \bigl[-\log p_\theta(x_0^\ell|x_t)\bigr] \prod_{j=1}^L(-\alpha'_{j}(t_j))\,dt

where αj\alpha'_{j} denotes the derivative of αj\alpha_{j}. The product form describes the joint density over transition times tt_\ell at which each token becomes unmasked (Garg et al., 24 Nov 2025).

2. Equivalence to Learned-Order Autoregressive Models

The introduction of coordinate-wise schedules enables the emergence of a non-uniform distribution over decoding orders. The time tt^*_\ell at which token \ell transitions is distributed as P(tt)=1α(t)P(t^*_\ell \le t) = 1 - \alpha_\ell(t), f(t)=α(t)f_\ell(t) = -\alpha'_\ell(t) (Proposition 2.1). Sorting the tt^*_\ell yields a permutation π\pi representing a decoding order.

The central result (Proposition 1.1) is the decomposition:

LMDM=πSLP(π)[i=1LEtπ(i)πlogpθ(x0π(i)x0π(<i),tπ(i))]\mathcal{L}_{\mathrm{MDM}} = \sum_{\pi\in S_L} P(\pi) \Bigl[ -\sum_{i=1}^L \mathbb{E}_{t^*_{\pi(i)}|\pi}\log p_\theta(x_0^{\pi(i)}|x_0^{\pi(<i)}, t^*_{\pi(i)}) \Bigr]

where P(π)P(\pi) is defined by integrating over joint transition times consistent with permutation π\pi. When the model omits explicit time conditioning, this reduces to a mixture of standard AR losses:

LMDM=πSLP(π)LAR(π)\mathcal{L}_{\mathrm{MDM}} = \sum_{\pi\in S_L} P(\pi)\,\mathcal{L}_{\mathrm{AR}(\pi)}

thus establishing that MDMs with multivariate (learnable) noise schedules are mathematically equivalent to a mixture of autoregressive models over learned orders (Garg et al., 24 Nov 2025).

3. Order-Aware Training Algorithms and Noise Schedule Optimization

The mask schedule for each coordinate α(t)\alpha_\ell(t) is parameterized and optimized jointly with model parameters. Example parameterization: α(t)=1tw\alpha_\ell(t) = 1 - t^{w_\ell} with ww_\ell learned. Training proceeds as follows:

  1. Initialize model weights θ\theta and schedule parameters {w}\{w_\ell\}.
  2. For each sample x0x_0, draw uUniform[0,1]u_\ell\sim\mathrm{Uniform}[0,1], set t=α1(1u)t_\ell = \alpha_\ell^{-1}(1 - u_\ell).
  3. Construct masks xtx_t masking token \ell iff u>α(t)u_\ell > \alpha_\ell(t_\ell).
  4. Compute weighted reconstruction losses and backpropagate through both θ\theta and {w}\{w_\ell\} using reparameterization gradients (e.g., RLOO [Kool et al., 2019]).

This framework enables the discovery of favorable, task-adaptive decoding orders (Garg et al., 24 Nov 2025).

4. Hierarchical, Video, and Scalable Extensions

Hierarchical MAR

Hi-MAR introduces a two-stage generation hierarchy leveraging low-resolution token pivots. The joint over tokens is factorized as

p(Zlow,Zhigh)=i=1Np(zilowz<ilow)j=1Mp(zjhighZlow,z<jhigh)p(Z^{\text{low}}, Z^{\text{high}}) = \prod_{i=1}^{N} p(z_i^{\text{low}} | z_{<i}^{\text{low}}) \cdot \prod_{j=1}^M p(z_j^{\text{high}} | Z^{\text{low}}, z_{<j}^{\text{high}})

Generation proceeds by first predicting low-res pivots, then refining to high-res tokens, using scale-aware Transformers and diffusion-denoising heads to propagate global structure efficiently. Ablation and benchmark results indicate substantial improvements in sample quality and inference time over single-scale AR and baseline MAR approaches (Zheng et al., 26 May 2025).

Video Generation: MarDini

MarDini adapts MAR/MDM for scalable video generation using a low-res MAR planner for temporal structure and a lightweight diffusion generator for high-res frames. The architecture divides spatio-temporal modeling into computationally feasible components, applying masked reconstructions at low-res and denoising via DDPM at high-res. This enables efficient, versatile conditioning for interpolation, expansion, and image-to-video tasks, yielding state-of-the-art interpolation metrics and substantial compute savings (Liu et al., 26 Oct 2024).

5. Acceleration, Distillation, and Reinforcement Learning

Standard MAR models suffer from prohibitive inference times due to their nested outer AR and inner diffusion chains (e.g., KK AR steps ×\times TT diffusion steps). MARVAL (Masked Auto-regressive Variational Acceleration) addresses this by distilling the diffusion chain in each AR step to a single generator pass. The procedure uses a score-based variational objective (GSIM) that encourages the distilled model to implicitly match the teacher's conditional distribution, enabling over 30-fold inference speedup with minimal quality degradation (e.g., FID=2.00 at >30×>30\times speedup for ImageNet 256x256) (Gu et al., 19 Nov 2025).

MARVAL-RL extends this acceleration into practical reinforcement learning settings by treating the MARVAL generator as a stochastic policy. RL fine-tuning is performed end-to-end with differentiable reward models, leading to measurable gains in alignment metrics like CLIP and image-reward scores (Gu et al., 19 Nov 2025).

6. Empirical Validation and Order Learning

Experiments demonstrate that learned decoding orders, as induced via multivariate masking schedules, produce lower validation loss and may yield minor but consistent improvements in data fidelity compared to fixed-order MDMs. Visualization of schedules reveals nonuniform “unmasking” across coordinates, implicating the choice of masking policy as a significant degree of freedom for optimization (Garg et al., 24 Nov 2025). Empirical studies in Hi-MAR, MarDini, and MARVAL further confirm that refining the ordering, scale hierarchy, and efficient decoding jointly enhances both sample quality and computational efficiency (Zheng et al., 26 May 2025, Liu et al., 26 Oct 2024, Gu et al., 19 Nov 2025).

7. Limitations and Future Directions

While MAR-based models offer unprecedented flexibility in generative modeling—accommodating arbitrary decoding orders, hierarchies, and scalable video conditioning—they face ongoing challenges in efficient RL post-training, memory scaling for reward models, and one-shot distillation of the entire AR/diffusion chain (Gu et al., 19 Nov 2025). Future research aims at meta-diffusion order selection, lighter-weight RL-compatible reward models, and applying the GSIM+RL blueprint to other AR-diffusion hybrids, including multi-modal and text-to-image domains.


Key References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Masked Auto-regressive Diffusion Models (MAR).