TraDo-8B-Instruct: Diffusion LLM with TraceRL

Updated 9 September 2025

TraDo-8B-Instruct is an 8-billion-parameter diffusion language model that uses a novel block-attention design to iteratively unmask token blocks for enhanced reasoning and parallel decoding.
It employs the TraceRL framework to integrate trajectory-aware reinforcement learning, aggregating token-level rewards over full inference trajectories for refined optimization.
Empirical benchmarks show state-of-the-art performance in math and coding tasks, with significant improvements over established autoregressive models.

TraDo-8B-Instruct is an 8-billion-parameter diffusion LLM (DLM) designed to excel in complex mathematical reasoning and coding, leveraging a novel trajectory-aware reinforcement learning framework (TraceRL) and block-attention architecture. In contrast to conventional autoregressive (AR) models, TraDo-8B-Instruct applies iterative denoising to masked token blocks, integrating fine-grained RL signals over full inference trajectories. This approach yields state-of-the-art accuracy and efficient generation, demonstrated by substantial improvements over leading AR models across advanced math benchmarks. The architecture, training paradigms, and open-source framework position TraDo-8B-Instruct as a reference model for the diffusion LLM family.

1. Diffusion Block-Attention Model Structure

TraDo-8B-Instruct is built on a block-attention DLM architecture, where token sequences are partitioned into fixed-size blocks (e.g., $B=4$ ). Each block is iteratively unmasked via a denoising process, supported by bidirectional attention that conditions on all visible tokens within each block. This design supports both semi-autoregressive and parallel decoding, enabled by an accelerated key-value (KV) cache mechanism that efficiently slices sequences into windowed subsets.

Key architectural features include:

Block-attention mechanism: Each attention head processes blocks, facilitating information flow between previously completed blocks and the current block.
Sampling strategies: The model supports both static sampling (one token per step) and dynamic sampling (parallel unmasking above a confidence threshold).
Hybrid AR and diffusion adaptation: Training starts from an AR backbone with subsequent diffusion-based adaptation, preserving reasoning skills and enabling longer chain-of-thought (CoT) solutions.
Open-source framework: A unified codebase (https://github.com/Gen-Verse/dLLM-RL) provides support for block diffusion models, full-attention variants, AR-based models, and integration with practical inference engines.

2. TraceRL: Trajectory-Aware Reinforcement Learning

TraceRL is a reinforcement learning protocol optimized for diffusion LLMs, incorporating both sequence-level and token-level feedback from the full inference trajectory:

Trajectory definition: The model produces $\tau = (\tau(1), \tau(2), ..., \tau(|\tau|))$ , where $\tau(t)$ is the set of tokens generated at step $t$ .
Shrinkage aggregation: A shrinkage parameter $s$ is used to aggregate trajectory steps for computational efficiency, reducing the number of forward passes without sacrificing trace fidelity.
Diffusion-based value model: Token-wise advantages are computed and regressed using:

$\mathcal{J}_{\text{value}}(\theta_v) = \frac{1}{2} \mathbb{E}_{\tau}\left[\frac{1}{|\tau|}\sum_j \max((V_{\theta_v}(\tau)_j - R_j)^2, (V_j^{(\text{clip})} - R_j)^2)\right]$

where $V_{\theta_v}$ is the value model, $R_j$ the reward, and "clip" stabilizes updates.

Policy gradient objective:

$\mathcal{J}_{\text{policy}}(\theta_p) = \mathbb{E}\left[\sum_i\sum_t\sum_{o_k\in\tau_i^s(t)} C_\varepsilon\left(\frac{\pi_{\theta_p}(o_k|\tau_i^s(1:t-1))}{\pi_{\text{old}}(o_k|\tau_i^s(1:t-1))}, A_i\right)/|\tau_i^s(t)|\right] - \beta\,\mathrm{KL}[\pi_{\theta_p}\|\pi_{\text{old}}]$

where $C_\varepsilon$ is a clipped surrogate function and $A_i$ the advantage.

3. Training Pipeline and Curriculum Learning

TraDo-8B-Instruct training proceeds via two distinct phases:

Semi-autoregressive supervised fine-tuning (SFT): The model learns blockwise, left-to-right token prediction using the objective

$\mathcal{J}_{\text{semi}}(x, Q, \theta) = \sum_{i=1}^{\lceil L/B \rceil} \mathcal{J}_{\text{full}}\left( x^{((i-1)B : \min(iB, L))}, [Q, x^{(0:(i-1)B)}], \theta \right)$

where $x$ is the input sequence, $Q$ is the prompt, $L$ is sequence length, and $B$ is block size.

TraceRL post-training: The RL protocol records full inference traces, computes shrunken advantages, and applies token-/step-wise rewards for policy improvement.

The curriculum includes block size enlargement (e.g., $B=4\to8$ ), which TraceRL adapts flexibly, yielding efficient parallel generation and improved sampling diversity.

4. Benchmarking and Empirical Results

TraDo-8B-Instruct is evaluated on demanding mathematical and coding benchmarks:

MATH500: Achieves 78.5% accuracy (static) and 75.5% (dynamic), translating to a 6.1% relative improvement over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct.
AIME2024 and LiveCodeBench: Consistently outperforms similar-sized AR models, despite its smaller parameter count.
Long-CoT variant: Through curriculum learning, "TraDo-8B-Thinking" achieves an 18.1% relative gain over Qwen2.5-7B-Instruct on MATH500 for multi-step reasoning.

These gains are supported by speed-up statistics in dynamic sampling, such as reduced average steps and increased token parallelism (see Table 2 in (Wang et al., 8 Sep 2025)).

Model	Static Acc (%)	Dynamic Acc (%)	Math Benchmark
TraDo-8B-Instruct	78.5	75.5	MATH500
Qwen2.5-7B-Instruct	~74	~71	MATH500
Llama3.1-8B-Instruct	51.9	6.7	MATH500

5. Algorithmic Framework and Implementation

The open-source dLLM framework supports construction, training, and inference for diffusion LLMs:

Multi-architecture support: Models can utilize block-attention, full-attention, AR-based, and hybrid designs.
Post-training optimization: Includes TraceRL, random masking RL, and associated baseline methods.
Accelerated inference: Implements fast KV cache strategies and parallel sequence slice processing.
Benchmarks and evaluation modules: Directly integrates mathematics, coding, and RL tasks for reproducibility and deployment.

Codebase and model weights are available at https://github.com/Gen-Verse/dLLM-RL, covering both research and applied use cases.

6. Context, Impact, and Outlook

The TraceRL framework and TraDo series DLMs signify a methodological advancement in LLM training—moving beyond AR exclusive RL by exploiting full trajectory information, supporting both token-level and sequence-level optimization. The documented empirical improvements over established models (Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct) demonstrate both the effectiveness of trace-aware RL in diffusion architectures and the promise of block-attention strategies for long-form reasoning and parallel sampling.

A plausible implication is that diffusion LLMs, coupled with trajectory-level RL, may offer superior solutions for tasks requiring long, internally consistent reasoning chains, and can be further scaled via curriculum learning, block adjustment, and advanced reward shaping. The code and model release facilitates reproducibility and domain adaptation, supporting rapid progress in both research and practical deployment.

PDF Markdown Chat (Pro)

References (1)

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TraDo-8B-Instruct.