NBDiff-7B: A 7B-scale Block-Diffusion LLM

Updated 14 December 2025

NBDiff-7B is a 7-billion parameter block-diffusion LLM re-adapted from a pre-trained AR model, featuring context-causal masking and curriculum-based block-size growth.
It employs an innovative parallel denoising workflow with auxiliary AR supervision, achieving state-of-the-art results on math, code, and reasoning benchmarks.
The model supports flexible generation methods including in-filling and bidirectional token dependency modeling, balancing compute efficiency with high performance.

NBDiff-7B is a 7-billion parameter Block-Diffusion LLM (DLM) constructed by a principled adaptation of a pre-trained autoregressive (AR) checkpoint (Pangu-Embedded-7B) to a block-diffusion generation paradigm. It integrates a unique context-causal attention mask, a parallelized denoising adaptation workflow, auxiliary AR supervision for knowledge retention, and a gradual block-size expansion curriculum. NBDiff-7B demonstrates state-of-the-art performance among 7B-scale diffusion LLMs on general, mathematical, and code generation benchmarks, providing a compute-efficient alternative to training DLMs from scratch (Tian et al., 7 Dec 2025).

1. Model Definition and Architecture

NBDiff-7B employs a Transformer backbone identical in configuration to its AR predecessor (32 layers, 4096 hidden size, 32 attention heads, feed-forward dimension 16384, vocabulary ≈64k), re-purposed as a block-discrete diffusion denoising network. The model operates over token sequences partitioned into blocks, refining partially masked (noised) sequences through iterative denoising in discrete or continuous time. At inference, block-wise parallel generation and intra-block bidirectional reasoning are enabled by the block-diffusion mask, allowing accelerated sequence synthesis and in-block token dependency modeling (Tian et al., 7 Dec 2025).

The forward (noising) process is based on an absorbing-mask discrete diffusion: for continuous time $t$ , each block of tokens is independently masked at probability $1-\alpha_t$ (typically, $\alpha_t=1-t$ ). The reverse (denoising) network, $p_\theta(x_0|x_t)$ , is trained to reconstruct the clean sequence $x_0$ from a block-masked $x_t$ by minimizing a weighted cross-entropy loss. At inference, token sequences can be generated or infilled in parallel (by blocks), and the model supports arbitrary positional update schemes (Ye et al., 21 Aug 2025).

2. Context-Causal Attention Mask

The central architectural innovation in NBDiff-7B is the context-causal mask, which preserves strict left-to-right (AR) causality among committed (clean context) tokens while unlocking bidirectional attention within the currently active block. The full attention mask over the concatenation $[x_t\|x]$ (noised plus clean tokens) admits three sub-masks:

$\mathbf{M}_{\mathrm{BD}}$ (block-diagonal): enables full attention among tokens within the same active block of size $b$ in $x_t$ .
$\mathbf{M}_{\mathrm{OBC}}$ (offset block-causal): allows noised tokens in a block to attend to clean context tokens only from strictly preceding blocks, maintaining temporal causality across blocks.
$\mathbf{M}_{\mathrm{CC}}$ (context-causal): standard lower-triangular mask enforces causal attention in the clean context.

Explicitly, when $b=1$ , this mask reduces to the AR (left-to-right) form; as $b$ increases (up to $b=32$ ), intra-block bidirectionality is incrementally unlocked, enabling a smooth adaptation between AR and full block-diffusion (Tian et al., 7 Dec 2025).

3. Adaptation Procedure: AR to Block-Diffusion

NBDiff-7B introduces a four-part adaptation recipe to transform a pre-trained AR model into a block-diffusion LLM:

Context-causal mask: as described, enabling intra-block bidirectionality while maintaining causal context supervision.
Parallel training scheme: all blocks are processed jointly in each forward pass, maximizing the efficiency and supervision density of the adaptation procedure.
Auxiliary AR loss: a next-token prediction loss on the committed context branch (under $\mathbf{M}_{\mathrm{CC}}$ ), yielding supervision $\mathcal{L}_{\mathrm{AR}}$ that is added to the masked denoising loss $\mathcal{L}_{\mathrm{MDM}}$ with weight $\lambda=0.5$ :

$\mathcal{L}_{\mathrm{total}}(\theta) = \mathcal{L}_{\mathrm{MDM}}(\theta) + \lambda\,\mathcal{L}_{\mathrm{AR}}(\theta)$

This term densifies the learning signal and preserves the AR knowledge of the base model through adaptation.

Gradual block-size growth: the block size $b$ is increased from $b_0=1$ (pure AR, strictly causal) toward $b_{\max}=32$ according to a curriculum:

$b(s) = \min\{b_{\max}, b_0 \cdot r^{\lfloor (s-s_0)/\Delta\rfloor}\}$

Here, $s$ is the adaptation step, $r$ is the growth base (typically 2), and $\Delta$ controls the interval between block-size doublings.

Pseudocode details for the workflow and efficiency optimizations—including parallel block supervision and co-scheduling of inference step counts and AR loss weight—appear in (Tian et al., 7 Dec 2025).

4. Training, Fine-tuning, and Inference

The end-to-end adaptation process proceeds in stages:

Pretraining-adaptation: Starting from an AR checkpoint, NBDiff-7B is adapted using ≈800 million tokens with an 8K context window, followed by context extension to 32K tokens for a further 100 billion tokens. The masked denoising loss and auxiliary AR loss are co-optimized throughout.
Supervised fine-tuning (SFT): An additional ≈10 billion tokens are used for instruction-tuning, employing the same masked denoising and AR objectives on supervised data.

During inference, block-diffusion decoding allows flexible generation:

The user selects the number of refinement steps $T$ , trading off between output quality and decoding speed.
The model natively supports generation, infilling (mask in the middle), and arbitrary positional update orders due to the non-causal intra-block procedure.
The amortized architecture (concatenated $x_t$ and $x$ inputs, joint KV-cache) ensures efficient parallel generation (Ye et al., 21 Aug 2025).

5. Empirical Performance and Comparisons

NBDiff-7B achieves state-of-the-art metrics among 7B-class DLMs:

On knowledge, math, and code benchmarks (NBDiff-7B-Base versus Dream-v0 (Ye et al., 21 Aug 2025) and LLaDA-MoE-7B): average 64.3% vs. 60.0% and 54.3%; GSM8K 79.6% vs. 77.8%; MATH500 46.0% vs. 39.6%; MMLU-Pro 52.7% vs. 48.2%.
On instruction-following and SFT datasets (NBDiff-7B-Instruct vs. SDAR-8B): average 78.8% vs. 74.0%; GSM8K 91.9%; MATH500 84.3%; HumanEval 87.8%; MBPP 84.1%.

Ablation experiments confirm the necessity of both the AR auxiliary loss (+4.0% absolute improvement) and gradual block-size growth for optimal transfer and performance.

These findings establish that NBDiff-7B not only retains the long-context reasoning and knowledge inherited from its AR progenitor but also significantly surpasses previous 7B-scale diffusion LLMs on benchmarks of general knowledge, reasoning, mathematics, and programming (Tian et al., 7 Dec 2025).

NBDiff-7B builds directly on the design principles established by Dream 7B and other recent DLMs (Ye et al., 21 Aug 2025). The general architecture, diffusion process, and discrete denoising tasks closely follow Dream 7B, but NBDiff-7B introduces a structured adaptation procedure from AR to block-diffusion, rather than direct AR weight transplant or logit masking. Whereas Dream 7B relies on AR-initialization and context-adaptive reweighting (CART), NBDiff-7B formalizes the AR-to-block path via curriculum-based block growth and integrated AR supervision, ensuring both empirical gains and theoretical alignment of inductive biases (Tian et al., 7 Dec 2025).

A tabular summary of key differences:

Model	Diffusion Masking	AR Transfer Mechanism	Curriculum
Dream 7B	Full-sequence, token-wise	Weight init + shift	None
NBDiff-7B	Block (context-causal)	Parallel adaptation + AR loss	Block growth

7. Limitations and Future Directions

While NBDiff-7B achieves high efficiency and benchmark performance among discrete DLMs, diffusion inference requires multiple model passes per output (one per refinement step), resulting in slower decoding than classical AR models at high-quality settings. Empirically, block-diffusion parallelism amortizes this cost for moderate block sizes and step counts (e.g., $b=32$ , $T\sim 10$ –20), but absolute performance is still constrained by memory and latency trade-offs (Tian et al., 7 Dec 2025).

Hyperparameter sensitivity—including block size schedule, AR loss weight, and optimizer parameters—remains critical for adaptation efficacy. Further research is warranted in extending context lengths, optimizing diffusion schedules, developing advanced post-training techniques (e.g., diffusion-aware RL, LoRA), and specialized domain adaptation for planning, theorem proving, and long-horizon reasoning tasks (Tian et al., 7 Dec 2025, Ye et al., 21 Aug 2025).

PDF Markdown Chat (Pro)

References (2)

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs (2025)

Dream 7B: Diffusion Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to NBDiff-7B.