Papers
Topics
Authors
Recent
Search
2000 character limit reached

Block Diffusion CPT

Updated 23 January 2026
  • The paper demonstrates that combining masked diffusion with auxiliary AR supervision significantly enhances training compute efficiency, yielding up to +4.0 percentage points improvement on benchmarks.
  • Block Diffusion CPT is a training paradigm that bridges autoregressive causality and bidirectional reasoning through a dynamic block-size curriculum and context-causal attention masks.
  • Empirical results confirm that CPT improves performance across language, mathematics, and code modeling while ensuring efficient token utilization and stable gradient behavior.

Block Diffusion Continual Pretraining (CPT) is a training paradigm that enables efficient adaptation of pretrained autoregressive (AR) LLMs to block-wise bidirectional diffusion LLMs (DLMs). CPT leverages masked diffusion objectives and dynamic attention masks, bridging the divergent autoregressive causality and block-wise bidirectionality found in DLM architectures. CPT has been shown to substantially improve training compute efficiency, enable full context utilization, and generate high-quality base models for further supervised fine-tuning or instruction tuning across language, mathematics, and code domains (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

1. Principled Overview of Block Diffusion CPT

CPT formalizes AR-to-block-diffusion adaptation as a monotonic curriculum, from autoregressive modeling (block size b=1b=1) to block-wise diffusion (block size b>1b>1), under a unified attention mask and loss framework. In this schema, a model initialized from a pretrained AR checkpoint is adapted via continual pretraining using block-structured noising, denoising, and bidirectional reasoning dynamics. The key CPT innovations include:

  • Context-causal attention masking that preserves strict left-to-right causality in the committed context while enabling full bidirectionality within the active block.
  • Parallel adaptation in which all blocks in a sequence are trained in one forward/backward pass, enabling efficient gradient computation and KV-cache reuse.
  • Auxiliary AR supervision to retain AR modeling strengths, maximize data utilization, and maintain consistency with the inherited knowledge and inference protocol.
  • Gradual block-size curriculum that incrementally expands the reasoning scope from AR (b=1b=1) to block diffusion (bb up to 32), with a scheduled annealing of AR loss weight.
  • Structured masking and loss design compatible with both language and code modeling (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

2. Attention Masking and Training Objectives

The core of CPT is the context-causal attention mask. A sequence x1:Lx_{1:L} is decomposed into a "committed context" C={1,,s1}C=\{1,\ldots,s-1\} and an "active block" B={s,,e}B=\{s,\ldots,e\} of size b=es+1b=e-s+1. The binary mask A(i,j)A(i, j) is precisely:

  • A(i,j)=1A(i,j) = 1 if iC,jC,jii \in C,\, j \in C,\, j \leq i (causal attention within context)
  • A(i,j)=1A(i,j) = 1 if iB,jCi \in B,\, j \in C (block tokens attend to entire prefix)
  • A(i,j)=1A(i,j) = 1 if iB,jBi \in B,\, j \in B (bidirectional within block)
  • A(i,j)=0A(i,j) = 0 otherwise

During CPT, model training optimizes a compound objective:

  • Masked diffusion (block-denoising) loss LMDM(θ)L_{\mathrm{MDM}}(\theta) on all masked tokens in the diffusion view.
  • Auxiliary autoregressive (AR) loss LAR(θ)L_{\mathrm{AR}}(\theta) on the clean context, with

LAR(θ)=E[iClogpθ(xi+1xi;MCC)]L_{\mathrm{AR}}(\theta) = \mathbb{E} \left[ - \sum_{i \in C} \log p_\theta(x_{i+1} | x_{\leq i}; M_{CC}) \right]

Joint training is performed with mixing weight λ\lambda (typically λ=0.5\lambda=0.5 for small blocks, annealed towards 0 as bbmaxb \rightarrow b_{\mathrm{max}}):

Ltotal(θ)=LMDM(θ)+λLAR(θ)L_{\mathrm{total}}(\theta) = L_{\mathrm{MDM}}(\theta) + \lambda L_{\mathrm{AR}}(\theta)

Empirical evidence indicates that adding LARL_{\mathrm{AR}} improves retention of AR knowledge and benefits finetuning performance, demonstrated by an ablation-based increase of +4.0+4.0 percentage points in average benchmark score compared to pure CPT (Tian et al., 7 Dec 2025).

3. Algorithmic Workflow and Block-Size Growth

CPT is realized via a parallel adaptation workflow. Each training iteration entails:

  1. Block size calculation: bb is set via a growth schedule:

b(s)=min{bmax,b0rmax(0,ss0)Δ}b(s) = \min\left\{b_{\mathrm{max}},\, b_0 r^{\left\lfloor \frac{\max(0,\, s - s_0)}{\Delta} \right\rfloor}\right\}

where b0=1b_0=1, bmax32b_{\mathrm{max}}\approx 32, r=2r=2, and Δ14,00020,000\Delta\sim 14,000-20,000.

  1. Sequence partitioning: sampled token sequence is divided into contiguous blocks of size bb.
  2. Diffusion step sampling: noising step t[0,1]t\in[0,1], with noise mask generated per block.
  3. Noised view creation: tokens in each block are masked/stochastically corrupted based on mask schedule.
  4. Parallel loss computation: both diffusion and AR objectives are computed simultaneously, leveraging dynamic attention masks that ensure consistency between training and inference (Tian et al., 7 Dec 2025).

A pseudocode representation is given explicitly in (Tian et al., 7 Dec 2025), emphasizing the efficiency and seamless curriculum from AR to block-diffusion.

For code modeling, especially in Stable-DiffCoder (Fan et al., 22 Jan 2026), a block-clipped noise schedule is used to guarantee nontrivial corruption per block. The block size is fixed (e.g., B=4B=4), and a tailored stepwise warmup on the corruption level (rather than attention mask) mitigates loss/gradient spikes arising from sudden objective and mask changes when transferring from AR checkpoints.

4. Mathematical Formulation and Implementation Specifics

Let x1:n0x^0_{1:n} be a clean token sequence. The diffusion process operates as a Markov chain:

x0x1xTx^0 \rightarrow x^1 \rightarrow \cdots \rightarrow x^T

with continuous or discrete tt as the noising step. The standard global linear schedule u(t)=1tu(t) = 1-t (fraction of tokens masked) is refined with a block-clip:

ublk(t)=min(1,max(u(t),1/B))u_{\text{blk}}(t) = \min(1,\, \max(u(t),\, 1/B))

At each training step:

  • Sample tt
  • Select a random block B{1,,N}\mathcal{B} \subset \{1, \ldots, N\} of size BB
  • Independently mask each iBi \in \mathcal{B} with probability ublk(t)u_{\text{blk}}(t)
  • If B\mathcal{B} has m=0m=0 masked tokens, a fallback forces masking of one random token in B\mathcal{B}

For stable adaptation (especially AR\rightarrowDLLM), a warmup schedule gradually increases the corruption level over the initial SwarmupS_{\mathrm{warmup}} steps, with loss weighting w(t)=1/ublk(t)w(t)=1/u_{\mathrm{blk}}(t) applied only after warmup. This design reduces gradient spikes and ensures loss alignment with AR continuation. Empirically, this produces a smooth root-shaped learning curve and improves stability (Fan et al., 22 Jan 2026).

5. Empirical and Benchmark Performance

NBDiff-7B (Base and Instruct) and Stable-DiffCoder-8B-Instruct, both trained with block diffusion CPT, demonstrate substantial gains against both autoregressive and previous DLM baselines:

Model Avg. Acc. (Gen/Math/Code) GSM8K MATH500 MMLU-Pro HumanEval MBPP MultiPL-E
NBDiff-7B-Base (Tian et al., 7 Dec 2025) 64.3% 79.6% 46.0% 52.7%
Dream-Base-7B (baseline) 60.0% 77.8% 39.6% 48.2%
NBDiff-7B-Instruct 78.8% 91.9% 81.7% 87.8%
SDAR-8B 74.0% 91.3% 78.6% 78.7%
Stable-DiffCoder-8B-Instruct (Fan et al., 22 Jan 2026) +1–3 pp +1–3 pp

Empirical results indicate:

  • Block diffusion CPT yields consistent absolute improvements (+1–3 pp) on a broad suite of code benchmarks (HumanEval, MBPP, CRUXEval, MultiPL-E) over the AR baseline (Fan et al., 22 Jan 2026).
  • For language and mathematics, NBDiff-7B achieves +4.3–6.4 points improvement over strong DLM competitors on aggregate and domain-specific benchmarks (Tian et al., 7 Dec 2025).
  • Ablation studies show that plain finetuning lags behind CPT; each key component (AR loss, block-growth) contributes substantial gains.
  • Particularly significant improvements are observed in low-resource coding languages (e.g., PHP, C#) and in tasks requiring any-order or editing operations (Fan et al., 22 Jan 2026).

6. Hyperparameter Choices and Ablation Insights

Key hyperparameters in CPT, as established by empirical ablations:

  • Block size: Small blocks (B=4B=4 for code; bmax32b_{\mathrm{max}}\approx 32 for language) optimize the balance between AR-aligned contexts and the degree of data augmentation provided by diffusion masking. Large blocks dilute effective context and slow learning (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).
  • Warmup length: A few thousand steps, sufficient to ramp up maximum corruption smoothly, are required for training stability.
  • Mixing weight λ\lambda: Initial value 0.5\approx 0.5, annealed to zero as block size increases.
  • Noise schedule: Linear u(t)u(t), block clipped to ensure at least one mask per block, avoiding wasted compute.
  • Training budget: For Stable-DiffCoder, 1.3T tokens (160k steps, batch size 512); for NBDiff-7B, similar allocations suffice to reach state-of-the-art (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

Ablation studies reveal that omitting block clipping leads to high fractions of zero-mask steps, and skipping warmup results in gradient norm spikes exceeding 10× the baseline. Both are necessary for maintaining stable transfer from AR to block diffusion (Fan et al., 22 Jan 2026).

7. Practical Integration and Recommendations

CPT has pragmatic advantages for large model training and deployment, including:

  • Compute efficiency: By reusing mature AR checkpoints, CPT eliminates the need for full-scratch block-diffusion pretraining.
  • Off-the-shelf AR compatibility: No extra parameters are introduced; a causal LM head for LARL_{AR} and dynamic mask logic suffice.
  • Flexible context extension: The same CPT curriculum generalizes to longer context settings (e.g., 32K) without fundamental changes.
  • Token utilization: The parallel two-view approach ensures KV-cache reuse and maximizes supervision density.
  • Quality-latency tradeoff: Practically, bmax32b_{\mathrm{max}}\approx 32, doubling block size every \sim15k steps, yields strong outcomes; fine-tuning continues in standard AR mode for downstream tasks (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

Block diffusion CPT constitutes a principled and empirically validated procedure for advancing AR LLMs to high-throughput, block-diffusion LLMs, consistently improving performance across general language, mathematics, and code modeling with a fraction of the training resources previously required.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Diffusion Continual Pretraining (CPT).