Papers
Topics
Authors
Recent
Search
2000 character limit reached

Block Autoregressive Diffusion Models

Updated 24 March 2026
  • Block Autoregressive Diffusion is a generative modeling approach that partitions sequences into contiguous blocks, blending autoregressive factorization with iterative diffusion refinement.
  • It employs specialized attention masking and hybrid transformer architectures to enforce causal context across blocks while enabling parallel, within-block denoising.
  • Empirical results across text, vision, video, audio, and graphs demonstrate improved fidelity and efficiency compared to pure autoregressive or diffusion methods.

Block Autoregressive Diffusion refers to a family of generative modeling techniques that interpolate between classical autoregressive (AR) and diffusion models by operating on contiguous "blocks" of content. At the core, these models factorize the joint distribution over a sequence (text, image, video, graph, etc.) into a product of conditional distributions over non-overlapping blocks, modeling each block’s conditional via an inner diffusion process while conditioning on all previously generated blocks. This approach combines the tractable likelihood optimization and flexibility of AR models with the iterative refinement, parallelism, and high-fidelity sample quality of diffusion techniques. Block-autoregressive diffusion architectures have found application across modalities, including natural language, vision, video, audio, and graphs, and form a foundational component of contemporary high-performance generative systems.

1. Mathematical Foundations and Factorization

Block autoregressive diffusion models partition an input sequence x1:Lx_{1:L} (or a permutation-equivariant structure, e.g., a graph) into BB non-overlapping, contiguous blocks: x=[x1,x2,...,xB]x = [x^1, x^2, ..., x^B] where xix^i may represent a span of tokens, pixels/patches, audio codes, or nodes/edges. The model factorizes the joint likelihood as: pθ(x)=i=1Bpθ(xix<i)p_\theta(x) = \prod_{i=1}^B p_\theta(x^i \mid x^{<i}) Each block’s conditional pθ(xix<i)p_\theta(x^i \mid x^{<i}) is realized as a diffusion process: the block is initialized with pure noise (continuous, e.g., Gaussian, or tokenwise masking for discrete domains) and iteratively denoised over TT steps, with the reverse transition at each tt conditioned on all previously denoised blocks x<ix^{<i}.

Standard formulations include:

  • Continuous domains: forward noising q(xtixt1i)=N(xti;αtxt1i,(1αt)I)q(x_t^i | x_{t-1}^i) = \mathcal{N}(x_t^i; \sqrt{\alpha_t} x_{t-1}^i, (1-\alpha_t)I), and reverse pθ(xt1ixti,x<i)=N(μθ(xti,t;c<i),βtI)p_\theta(x_{t-1}^i | x_t^i, x^{<i}) = \mathcal{N}(\mu_\theta(x_t^i, t; c_{<i}), \beta_t I) (Hu et al., 2024).
  • Discrete domains: masking diffusion q(xtix0i)=Cat(αtx0i+(1αt)e[MASK])q(x_t^i| x_0^i) = \text{Cat}(\alpha_t x_0^i + (1-\alpha_t)e_{\text{[MASK]}}), reverse pθ(xt1ixti,x<i)p_\theta(x_{t-1}^i | x_t^i, x^{<i}) is parameterized by a transformer or higher-order equivariant network (Arriola et al., 12 Mar 2025, Zhao et al., 2024).

This semiparametric factorization interpolates between AR (B=LB=L, blocksize 1) and pure diffusion (B=1B=1, all content generated in parallel) (Arriola et al., 12 Mar 2025, Tian et al., 7 Dec 2025).

2. Model Architectures and Attention Masking

A defining architectural feature is the use of attention masks that enforce blockwise dependencies:

For transformers, this leads to block-causal or hybrid attention masks:

  • "Skip-Causal Attention Mask" (SCAM): noisy tokens in a block attend to all clean tokens in prior blocks and to themselves; clean tokens remain AR-masked (Hu et al., 2024).
  • Context-causal or block-diagonal masks: enforce AR flow of context across blocks, with flexible attention within the active block (Tian et al., 7 Dec 2025).
  • Linear attention for video: cumulative key/value computations yield constant-memory KV caches for arbitrarily long sequences (Chen et al., 29 Sep 2025).

Some models, such as MADFormer, vertically mix AR and diffusion layers to balance global structure and local refinement (Chen et al., 9 Jun 2025).

3. Training Objectives and Algorithms

Training relies on matching a diffusion-based noise prediction or eq. ELBO loss within each block with the AR factorized likelihood across blocks: L=Ei,t,ϵϵθ(ni(t);t,c<i)ϵ2\mathcal{L} = \mathbb{E}_{i, t, \epsilon} \| \epsilon_\theta(n_i^{(t)}; t, c_{<i}) - \epsilon \|^2 for continuous latents, or negative ELBO/cross-entropy over masked tokens for discrete blocks (Hu et al., 2024, Arriola et al., 12 Mar 2025, Tian et al., 7 Dec 2025). Efficient implementations interleave clean and noisy views and employ vectorized blockwise masking to allow all blocks to be trained with a single forward pass. Variance-reduction strategies, such as asynchronous blockwise noise scheduling (ABNS), effective mask ratio scaling (EMRS), and blockwise beta noise curriculum, stabilize and accelerate convergence (Cheng et al., 16 Dec 2025).

In AR checkpoint adaptation, an auxiliary AR loss reuses next-token targets on the context and a gradual curriculum increases block size to maximize data and knowledge retention (Tian et al., 7 Dec 2025). For dynamic block length (CtrlDiff), reinforcement learning optimizes block size selection for the quality–efficiency trade-off (Huang et al., 20 May 2025).

4. Inference and Decoding Procedures

Sampling proceeds autoregressively across blocks—looping BB times—while within-block denoising is executed either in parallel (all tokens at once) or with iterative confidence-based refinement:

For efficient language and vision-language inference, blockwise diffusion typically yields a D/SD/S speedup (block size DD, denoising steps SS per block) over token-by-token AR, with measured 2–2.5× wall-clock acceleration at comparable quality (Zeng et al., 17 Dec 2025, Wu et al., 30 Sep 2025). In video and audio, constant-memory caching (e.g., SANA-Video) permits minute-long generation at fixed resource cost (Chen et al., 29 Sep 2025), and block autoregressive inference can extend arbitrarily ("streaming" generation) (Team et al., 25 Nov 2025, Zhang et al., 28 Nov 2025).

5. Empirical Performance and Trade-offs

Block autoregressive diffusion yields strong performance across modalities:

  • Vision (ImageNet 256×256): ACDiT achieves FID ≈ 2.4–2.5, matching full-sequence diffusion and outperforming discrete AR baselines (Hu et al., 2024).
  • Video: BlockVid delivers up to 22.2% and 19.4% improvement in long-horizon coherence metrics on LV-Bench versus strong AR and diffusion competitors (Zhang et al., 28 Nov 2025).
  • Language: Block-diffusion LMs (e.g., BD3-LM) close much of the perplexity gap to AR, and dynamic block-length adaptation (CtrlDiff) narrows the gap further while enabling controllability (Arriola et al., 12 Mar 2025, Huang et al., 20 May 2025).
  • Speech: DiSTAR leverages AR block prediction and diffusion infilling for robust long-form, zero-shot speech synthesis at state-of-the-art accuracy (Song et al., 14 Oct 2025).
  • Graph Generation: PARD achieves SOTA (e.g., QM9, MOSES) with efficient permutation-invariance and parallel blockwise training via a partial order over graph elements (Zhao et al., 2024).

Block size and diffusion step count control the latency-quality trade-off: small blocks increase AR chain length but reduce per-block compute and permit long context; large blocks approach full-sequence diffusion (high fidelity, slower inference). Empirical studies find block sizes of 4–16 often optimal for vision and text (Hu et al., 2024, Arriola et al., 12 Mar 2025).

6. Domain-specific Adaptations and Extensions

7. Limitations, Design Implications, and Future Directions

Block autoregressive diffusion models introduce additional hyperparameters (block size, diffusion steps per block), and block boundaries can introduce minor artifacts or coherence breaks if not managed (e.g., via chunkwise shuffling/noise blending in videos) (Zhang et al., 28 Nov 2025). Cache growth with sequence length can be mitigated through sparsification, offloading, or learned quantization (Team et al., 25 Nov 2025). Dynamic block scheduling and multi-stage refinement ("draft-then-refine") further address irreversibility and local myopia, closing the performance gap to pure autoregressive models while reducing inference complexity (Ma et al., 20 Jan 2026).

Future work targets unified modeling across modalities, scalable world simulation, dynamic block allocation, and advanced training curricula to further shrink inefficiencies and unlock new generative capabilities (Hu et al., 2024, Team et al., 25 Nov 2025, Arriola et al., 12 Mar 2025).


Block autoregressive diffusion thus provides a powerful, theoretically grounded, and empirically validated path toward generative models that combine the fidelity and parallelism of diffusion processes with the flexibility, context handling, and incremental control of autoregressive architectures (Hu et al., 2024, Arriola et al., 12 Mar 2025, Tian et al., 7 Dec 2025, Zeng et al., 17 Dec 2025, Wu et al., 30 Sep 2025, Team et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Autoregressive Diffusion.