Block Diffusion CPT

Updated 23 January 2026

The paper demonstrates that combining masked diffusion with auxiliary AR supervision significantly enhances training compute efficiency, yielding up to +4.0 percentage points improvement on benchmarks.
Block Diffusion CPT is a training paradigm that bridges autoregressive causality and bidirectional reasoning through a dynamic block-size curriculum and context-causal attention masks.
Empirical results confirm that CPT improves performance across language, mathematics, and code modeling while ensuring efficient token utilization and stable gradient behavior.

Block Diffusion Continual Pretraining (CPT) is a training paradigm that enables efficient adaptation of pretrained autoregressive (AR) LLMs to block-wise bidirectional diffusion LLMs (DLMs). CPT leverages masked diffusion objectives and dynamic attention masks, bridging the divergent autoregressive causality and block-wise bidirectionality found in DLM architectures. CPT has been shown to substantially improve training compute efficiency, enable full context utilization, and generate high-quality base models for further supervised fine-tuning or instruction tuning across language, mathematics, and code domains (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

1. Principled Overview of Block Diffusion CPT

CPT formalizes AR-to-block-diffusion adaptation as a monotonic curriculum, from autoregressive modeling (block size $b=1$ ) to block-wise diffusion (block size $b>1$ ), under a unified attention mask and loss framework. In this schema, a model initialized from a pretrained AR checkpoint is adapted via continual pretraining using block-structured noising, denoising, and bidirectional reasoning dynamics. The key CPT innovations include:

Context-causal attention masking that preserves strict left-to-right causality in the committed context while enabling full bidirectionality within the active block.
Parallel adaptation in which all blocks in a sequence are trained in one forward/backward pass, enabling efficient gradient computation and KV-cache reuse.
Auxiliary AR supervision to retain AR modeling strengths, maximize data utilization, and maintain consistency with the inherited knowledge and inference protocol.
Gradual block-size curriculum that incrementally expands the reasoning scope from AR ( $b=1$ ) to block diffusion ( $b$ up to 32), with a scheduled annealing of AR loss weight.
Structured masking and loss design compatible with both language and code modeling (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

2. Attention Masking and Training Objectives

The core of CPT is the context-causal attention mask. A sequence $x_{1:L}$ is decomposed into a "committed context" $C=\{1,\ldots,s-1\}$ and an "active block" $B=\{s,\ldots,e\}$ of size $b=e-s+1$ . The binary mask $A(i, j)$ is precisely:

$A(i,j) = 1$ if $i \in C,\, j \in C,\, j \leq i$ (causal attention within context)
$A(i,j) = 1$ if $i \in B,\, j \in C$ (block tokens attend to entire prefix)
$A(i,j) = 1$ if $i \in B,\, j \in B$ (bidirectional within block)
$A(i,j) = 0$ otherwise

During CPT, model training optimizes a compound objective:

Masked diffusion (block-denoising) loss $L_{\mathrm{MDM}}(\theta)$ on all masked tokens in the diffusion view.
Auxiliary autoregressive (AR) loss $L_{\mathrm{AR}}(\theta)$ on the clean context, with

$L_{\mathrm{AR}}(\theta) = \mathbb{E} \left[ - \sum_{i \in C} \log p_\theta(x_{i+1} | x_{\leq i}; M_{CC}) \right]$

Joint training is performed with mixing weight $\lambda$ (typically $\lambda=0.5$ for small blocks, annealed towards 0 as $b \rightarrow b_{\mathrm{max}}$ ):

$L_{\mathrm{total}}(\theta) = L_{\mathrm{MDM}}(\theta) + \lambda L_{\mathrm{AR}}(\theta)$

Empirical evidence indicates that adding $L_{\mathrm{AR}}$ improves retention of AR knowledge and benefits finetuning performance, demonstrated by an ablation-based increase of $+4.0$ percentage points in average benchmark score compared to pure CPT (Tian et al., 7 Dec 2025).

3. Algorithmic Workflow and Block-Size Growth

CPT is realized via a parallel adaptation workflow. Each training iteration entails:

Block size calculation: $b$ is set via a growth schedule:

$b(s) = \min\left\{b_{\mathrm{max}},\, b_0 r^{\left\lfloor \frac{\max(0,\, s - s_0)}{\Delta} \right\rfloor}\right\}$

where $b_0=1$ , $b_{\mathrm{max}}\approx 32$ , $r=2$ , and $\Delta\sim 14,000-20,000$ .

Sequence partitioning: sampled token sequence is divided into contiguous blocks of size $b$ .
Diffusion step sampling: noising step $t\in[0,1]$ , with noise mask generated per block.
Noised view creation: tokens in each block are masked/stochastically corrupted based on mask schedule.
Parallel loss computation: both diffusion and AR objectives are computed simultaneously, leveraging dynamic attention masks that ensure consistency between training and inference (Tian et al., 7 Dec 2025).

A pseudocode representation is given explicitly in (Tian et al., 7 Dec 2025), emphasizing the efficiency and seamless curriculum from AR to block-diffusion.

For code modeling, especially in Stable-DiffCoder (Fan et al., 22 Jan 2026), a block-clipped noise schedule is used to guarantee nontrivial corruption per block. The block size is fixed (e.g., $B=4$ ), and a tailored stepwise warmup on the corruption level (rather than attention mask) mitigates loss/gradient spikes arising from sudden objective and mask changes when transferring from AR checkpoints.

4. Mathematical Formulation and Implementation Specifics

Let $x^0_{1:n}$ be a clean token sequence. The diffusion process operates as a Markov chain:

$x^0 \rightarrow x^1 \rightarrow \cdots \rightarrow x^T$

with continuous or discrete $t$ as the noising step. The standard global linear schedule $u(t) = 1-t$ (fraction of tokens masked) is refined with a block-clip:

$u_{\text{blk}}(t) = \min(1,\, \max(u(t),\, 1/B))$

At each training step:

Sample $t$
Select a random block $\mathcal{B} \subset \{1, \ldots, N\}$ of size $B$
Independently mask each $i \in \mathcal{B}$ with probability $u_{\text{blk}}(t)$
If $\mathcal{B}$ has $m=0$ masked tokens, a fallback forces masking of one random token in $\mathcal{B}$

For stable adaptation (especially AR $\rightarrow$ DLLM), a warmup schedule gradually increases the corruption level over the initial $S_{\mathrm{warmup}}$ steps, with loss weighting $w(t)=1/u_{\mathrm{blk}}(t)$ applied only after warmup. This design reduces gradient spikes and ensures loss alignment with AR continuation. Empirically, this produces a smooth root-shaped learning curve and improves stability (Fan et al., 22 Jan 2026).

5. Empirical and Benchmark Performance

NBDiff-7B (Base and Instruct) and Stable-DiffCoder-8B-Instruct, both trained with block diffusion CPT, demonstrate substantial gains against both autoregressive and previous DLM baselines:

Model	Avg. Acc. (Gen/Math/Code)	GSM8K	MATH500	MMLU-Pro	HumanEval	MBPP	MultiPL-E
NBDiff-7B-Base (Tian et al., 7 Dec 2025)	64.3%	79.6%	46.0%	52.7%	—	—	—
Dream-Base-7B (baseline)	60.0%	77.8%	39.6%	48.2%	—	—	—
NBDiff-7B-Instruct	78.8%	91.9%	—	81.7%	87.8%	—	—
SDAR-8B	74.0%	91.3%	—	78.6%	78.7%	—	—
Stable-DiffCoder-8B-Instruct (Fan et al., 22 Jan 2026)	—	—	—	—	+1–3 pp	—	+1–3 pp

Empirical results indicate:

Block diffusion CPT yields consistent absolute improvements (+1–3 pp) on a broad suite of code benchmarks (HumanEval, MBPP, CRUXEval, MultiPL-E) over the AR baseline (Fan et al., 22 Jan 2026).
For language and mathematics, NBDiff-7B achieves +4.3–6.4 points improvement over strong DLM competitors on aggregate and domain-specific benchmarks (Tian et al., 7 Dec 2025).
Ablation studies show that plain finetuning lags behind CPT; each key component (AR loss, block-growth) contributes substantial gains.
Particularly significant improvements are observed in low-resource coding languages (e.g., PHP, C#) and in tasks requiring any-order or editing operations (Fan et al., 22 Jan 2026).

6. Hyperparameter Choices and Ablation Insights

Key hyperparameters in CPT, as established by empirical ablations:

Block size: Small blocks ( $B=4$ for code; $b_{\mathrm{max}}\approx 32$ for language) optimize the balance between AR-aligned contexts and the degree of data augmentation provided by diffusion masking. Large blocks dilute effective context and slow learning (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).
Warmup length: A few thousand steps, sufficient to ramp up maximum corruption smoothly, are required for training stability.
Mixing weight $\lambda$ : Initial value $\approx 0.5$ , annealed to zero as block size increases.
Noise schedule: Linear $u(t)$ , block clipped to ensure at least one mask per block, avoiding wasted compute.
Training budget: For Stable-DiffCoder, 1.3T tokens (160k steps, batch size 512); for NBDiff-7B, similar allocations suffice to reach state-of-the-art (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

Ablation studies reveal that omitting block clipping leads to high fractions of zero-mask steps, and skipping warmup results in gradient norm spikes exceeding 10× the baseline. Both are necessary for maintaining stable transfer from AR to block diffusion (Fan et al., 22 Jan 2026).

7. Practical Integration and Recommendations

CPT has pragmatic advantages for large model training and deployment, including:

Compute efficiency: By reusing mature AR checkpoints, CPT eliminates the need for full-scratch block-diffusion pretraining.
Off-the-shelf AR compatibility: No extra parameters are introduced; a causal LM head for $L_{AR}$ and dynamic mask logic suffice.
Flexible context extension: The same CPT curriculum generalizes to longer context settings (e.g., 32K) without fundamental changes.
Token utilization: The parallel two-view approach ensures KV-cache reuse and maximizes supervision density.
Quality-latency tradeoff: Practically, $b_{\mathrm{max}}\approx 32$ , doubling block size every $\sim$ 15k steps, yields strong outcomes; fine-tuning continues in standard AR mode for downstream tasks (Tian et al., 7 Dec 2025, Fan et al., 22 Jan 2026).

Block diffusion CPT constitutes a principled and empirically validated procedure for advancing AR LLMs to high-throughput, block-diffusion LLMs, consistently improving performance across general language, mathematics, and code modeling with a fraction of the training resources previously required.

Markdown Report Issue Upgrade to Chat

References (2)

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs (2025)

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Diffusion Continual Pretraining (CPT).

Block Diffusion CPT

1. Principled Overview of Block Diffusion CPT

2. Attention Masking and Training Objectives

3. Algorithmic Workflow and Block-Size Growth

4. Mathematical Formulation and Implementation Specifics

5. Empirical and Benchmark Performance

6. Hyperparameter Choices and Ablation Insights

7. Practical Integration and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Block Diffusion CPT

1. Principled Overview of Block Diffusion CPT

2. Attention Masking and Training Objectives

3. Algorithmic Workflow and Block-Size Growth

4. Mathematical Formulation and Implementation Specifics

5. Empirical and Benchmark Performance

6. Hyperparameter Choices and Ablation Insights

7. Practical Integration and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research