Block-Wise Diffusion in Generative Models

Updated 4 February 2026

Block-wise diffusion is a generative modeling paradigm that partitions input sequences or images into blocks for independent diffusion and autoregressive dependency modeling.
It enables parallel processing and efficient inference while maintaining global context via block-level strategies, balancing performance with resource use.
Applications span language, vision, and video generation, demonstrating empirical gains in speed, parameter efficiency, and controllability across benchmarks.

Block-Wise Diffusion

Block-wise diffusion refers to a general class of generative modeling techniques in which a sequence or structure is partitioned into contiguous blocks, each block is processed independently or semi-independently via diffusion mechanisms (typically denoising or masking in the discrete or continuous domain), and block-level dependencies are modeled autoregressively, semi-autoregressively, or with tailored architectural strategies. This paradigm encompasses and unifies several advances across language modeling, vision-language learning, video and image generation, and neural network architecture search. Block-wise diffusion achieves a balance between global flexibility and efficient, fine-grained, or parallel computation, and its specific design choices critically determine modeling fidelity, controllability, parameter/memory efficiency, and inference speed.

1. Core Principles and Motivations

Block-wise diffusion was introduced to mitigate scalability and alignment challenges in both discrete and continuous diffusion modeling. Classical diffusion models, although effective for parallel denoising and global reasoning, suffer from expensive inference passes and mismatch with sequential generation objectives. In contrast, autoregressive (AR) models offer natural left-to-right dependencies and easy likelihood computation but are limited to sequential, token-by-token inference. Block-wise diffusion interpolates between these extremes:

Block Partitioning: The input (text, image, video, or internal feature) is split into $B$ non-overlapping blocks of uniform or dynamic size. For a sequence $\mathbf{x}^{1:L}$ , blocks are $\mathbf{x}^{(b)} = \mathbf{x}^{[(b{-}1)L'+1\,:\,bL']}$ with $B = L / L'$ for block size $L'$ (Arriola et al., 12 Mar 2025).
Intra-block Diffusion: Each block undergoes independent or locally conditioned forward corruption (e.g., masking, Gaussian noise) and reverse denoising, often with bidirectional attention or local receptive field (Tian et al., 7 Dec 2025, Arriola et al., 12 Mar 2025).
Inter-block Dependency: Blocks are generated or denoised autoregressively or semi-autoregressively. That is, the output for block $b$ depends on the denoised/clean outputs from all prior blocks, enforcing causal dependencies across blocks (Arriola et al., 12 Mar 2025, Huang et al., 20 May 2025).
Objective Alignment: Training objectives are tailored to mirror the block-wise inference process (e.g., block-level loss, masking strategy), closing the train–inference gap (Sun et al., 27 Aug 2025).
Parallelism and Efficiency: Within each block, inference and backpropagation can be parallelized across tokens or spatial locations, mitigating the AR bottleneck (Tian et al., 7 Dec 2025, Arriola et al., 12 Mar 2025).

This design paradigm enables control over the trade-off between global context modeling, local reasoning, sample efficiency, memory usage, and inference speed.

2. Mathematical Framework and Objectives

The formalism of block-wise diffusion extends the variational objective and transition dynamics of standard diffusion models to block-structured data. For discrete text, the training objective for a block $\mathbf{x}^{(b)}$ conditioned on its cleaned prefix is typically:

$\mathcal{L}_{\text{BD}} = \sum_{b=1}^B \mathbb{E}_{t \sim [0,1]} \mathbb{E}_{q}\left[\frac{t'}{1-t} \log p_\theta\left(\mathbf{x}^{(b)} \mid \mathbf{x}^{(b)}_t, \mathbf{x}^{<b}\right)\right]$

(Arriola et al., 12 Mar 2025)

For diffusion in block-wise vision-language transformers, blocks $x^b$ of length $L'$ are corrupted by block-specific masking rates, and the expected negative log-likelihood (NELBO) per block is

$\mathcal{L} = \mathbb{E}_{x, b, t}\left[ -\frac{1}{t} \sum_{\ell \in \mathcal{M}_t^b} \log p_\theta\left(x_{0,\ell}^b \mid x_t^b, x_0^{<b}\right)\right]$

(Cheng et al., 16 Dec 2025)

Blockwise SFT for LLMs instead sharpens alignment by restricting masking and loss computation to one active block per step, keeping the prefix fixed and future hidden, minimizing

$\mathbb{E}_{x, a}\sum_{t=1}^T \omega_t\; \mathbb{E}_{z_t \sim q_t(\cdot|x)} \left[-\sum_{i \in \mathcal{I}_a} \log p_\theta(x_i|z_t, t)\right]$

(Sun et al., 27 Aug 2025)

Empirical studies demonstrate that closely matching the granularity of the training loss (block-local) to the granularity of the decoding/generation process is critical for achieving strong likelihoods and downstream accuracy (Sun et al., 27 Aug 2025, Tian et al., 7 Dec 2025).

3. Training and Inference Algorithms

Block-wise diffusion models support diverse algorithmic workflows:

Blockwise Factorization and Masking: Discrete diffusion is applied within each block using masking (absorbing state, Bernoulli) schedules, with the reverse network predicting token-level probabilities conditioned on both noisy block inputs and block-level AR prefixes (Arriola et al., 12 Mar 2025, Tian et al., 7 Dec 2025).
Training Pseudocode:
- Blockwise SFT: Partition into blocks, sample an active block, mask the suffix, carry out cross-entropy loss on only the active block, and update model parameters via block-local gradients; see the pseudocode in (Sun et al., 27 Aug 2025).
- Context-causal adaptation: Gradually increase the block size during adaptation from AR to block-diffusion models, leveraging both diffusion-style loss and auxiliary AR loss (Tian et al., 7 Dec 2025).
Inference Pseudocode:
- At each block step, initialize the current block with full noise (or mask), refine via T denoising steps in parallel over positions, then commit the generated block and move to the next, using cached keys/values for prior blocks for efficient self-attention (Arriola et al., 12 Mar 2025, Tian et al., 7 Dec 2025).
Dynamic Block Size: In CtrlDiff, block length is chosen dynamically via an RL-trained policy, optimizing a reward balancing fluency and efficiency (Huang et al., 20 May 2025).
Draft-then-Refine: Diffusion-in-Diffusion employs a two-stage block-wise system: a fast draft with small blocks and then global bidirectional refinement on low-confidence tokens (Ma et al., 20 Jan 2026).

4. Empirical Performance and Ablative Analyses

Block-wise diffusion has been applied to large language modeling, vision-language understanding, and generative media tasks with substantial empirical gains compared to both standard AR and diffusion baselines:

Language Modeling: On GSM8K, blockwise SFT reaches 76% pass@1, outperforming classical SFT (68%), and on MATH500, 34% vs 30% (Sun et al., 27 Aug 2025). NBDiff-7B achieves state-of-the-art results among 7B-class diffusion LLMs, with up to 91.9% pass@1 on GSM8K and 84.3% on MATH500 (Tian et al., 7 Dec 2025).
Data Efficiency: Draft-then-refine achieves PPL=21.9 on OpenWebText using just 26% of the fine-tuning budget of baseline block diffusion (PPL=25.7) (Ma et al., 20 Jan 2026).
Vision-Language: SDAR-VL with asynchronous blockwise noise, effective mask ratio scaling, and beta noise curriculum surpasses prior diffusion and AR baselines on key VLU benchmarks, with up to 25% reduction in effective training steps and added stability (Cheng et al., 16 Dec 2025).
Ablation Results: Training-inference block size mismatch degrades performance, and noise or leakage in prefix/suffix harms accuracy (Sun et al., 27 Aug 2025, Tian et al., 7 Dec 2025).
Flexibility: Adaptive block size selection and controllable guidance extend blockwise diffusion to scenarios requiring variable output lengths and explicit attribute control (Huang et al., 20 May 2025).
Controllability: Classifier-guided conditioning enables fine-grained control (e.g., for sentiment or attribute) on a per-block basis during diffusion sampling (Huang et al., 20 May 2025).

5. Block-wise Diffusion for Efficiency and Acceleration

Block-wise structure underpins inference acceleration and memory efficiency, especially in large-scale diffusion transformers:

Interval Caching: CorGi caches low-contribution transformer blocks across denoising intervals based on per-block CKA scores, reusing rather than recomputing redundant features and protecting salient tokens for text-to-image tasks (CorGi+). This yields ≈2× speedup with minimal FID/LPIPS quality loss (Son et al., 30 Dec 2025).
Dynamic Feature Reuse: BWCache for video diffusion triggers caching based on block-similarity (cosine similarity threshold), achieving up to 2.24× latency reduction with negligible loss in VBench or LPIPS (Cui et al., 17 Sep 2025).
Structure-Aware Caching: BlockDance identifies “Structurally Similar Spatio-Temporal” (STSS) features—blocks whose outputs change negligibly at adjacent steps—and skips recomputation. BlockDance-Ada further learns an instance-dependent policy for cache/reuse, achieving 25–50% speedup across major video/image backbones (Zhang et al., 20 Mar 2025).
Blockwise Partitioned Training: DiffusionBlocks partitions the network itself into independently trained residual blocks, each handling a specific noise interval for generative diffusion. Training memory is reduced by 1/B for B blocks, with no loss in FID or MAUVE (Shing et al., 17 Jun 2025).
NAS for Structural Redundancy Removal: DiffNAS leverages blockwise distillation and local neural architecture search per block to remove block-level redundancy in UNet backbones, cutting MACs and parameters up to ∼50% with on-par FID (Tang et al., 2023).

6. Parameter Efficiency, Retrieval, and Conditional Generation

Blockwise diffusion underlies strategies for model compression and structured generation:

Parameter-Efficient Generation: RISSOLE divides VQ-GAN latents into $b$ blocks, using a single low-capacity U-Net with blockwise retrieval-guided conditioning. This reduces model size by $4-10\times$ , achieving FID of 9.82 on CelebA64 and 12.93 on ImageNet100, outperforming previous patchwise or baseline RDMs (Mukherjee et al., 2024).
Retrieval-Augmented Coherence: For each block position, nearest neighbour blocks from a retrieval database are used as conditioning, and context is fused additively before denoising. This yields strong coherence and sample quality without cross-attention or explicit positional encodings (Mukherjee et al., 2024).
Blockwise Diffusion with Internal Diffusion Graphs: Some approaches (e.g., Diff-ResNet) exploit blockwise diffusion at the architectural level to promote intra-class tightness and inter-class separability via explicit graph structures and ODE-inspired splitting (Wang et al., 2021).

7. Limitations, Theoretical Guarantees, and Future Directions

Despite robust empirical progress, blockwise diffusion faces open challenges:

Train–Inference Alignment: Gains are contingent on precise matching between training granularity, loss scaling, and inference blocksize; mismatch can sharply degrade end-task performance (Sun et al., 27 Aug 2025, Tian et al., 7 Dec 2025).
Error Propagation and Myopia: Autoregressive blockwise generation can accumulate long-range errors; bidirectional “draft–refine” strategies and snapshot remasking are effective for correcting such myopia (Ma et al., 20 Jan 2026).
Memory vs. Quality Trade-offs: While blockwise partitioning (e.g., DiffusionBlocks) reduces memory, overly aggressive splitting may impair modeling of global structure or inter-block dependencies (Shing et al., 17 Jun 2025).
Parallelism and O(K×M) Latency: Blockwise approaches, especially in language modeling, can suffer higher per-sample latency compared to O(K) AR models for long contexts unless block-level parallelism and optimized denoisers are employed (Shing et al., 17 Jun 2025).
Theory: The blockwise surrogate loss yields unbiased gradients and provable upper bounds on blockwise negative log-likelihood; prefix leakage and random global masking introduce gradient bias theoretical guarantees are formalized in (Sun et al., 27 Aug 2025).

Block-wise diffusion constitutes a fundamental axis of innovation in modern diffusion modeling, unifying advances in parallelization, efficient supervision, controllable and retrieval-based generation, and resource/loss scaling techniques across domains. Its principled mathematical framework allows flexible trade-offs and enables application-specific tailoring of inference and efficiency strategies, cementing it as a crucial paradigm for scalable or controllable generative modeling and efficient deep network training (Arriola et al., 12 Mar 2025, Sun et al., 27 Aug 2025, Tian et al., 7 Dec 2025, Cheng et al., 16 Dec 2025, Son et al., 30 Dec 2025, Shing et al., 17 Jun 2025, Tang et al., 2023, Ma et al., 20 Jan 2026, Mukherjee et al., 2024, Huang et al., 20 May 2025, Wang et al., 2021).