TraDo-8B-Instruct: Diffusion LLM with TraceRL
- TraDo-8B-Instruct is an 8-billion-parameter diffusion language model that uses a novel block-attention design to iteratively unmask token blocks for enhanced reasoning and parallel decoding.
- It employs the TraceRL framework to integrate trajectory-aware reinforcement learning, aggregating token-level rewards over full inference trajectories for refined optimization.
- Empirical benchmarks show state-of-the-art performance in math and coding tasks, with significant improvements over established autoregressive models.
TraDo-8B-Instruct is an 8-billion-parameter diffusion LLM (DLM) designed to excel in complex mathematical reasoning and coding, leveraging a novel trajectory-aware reinforcement learning framework (TraceRL) and block-attention architecture. In contrast to conventional autoregressive (AR) models, TraDo-8B-Instruct applies iterative denoising to masked token blocks, integrating fine-grained RL signals over full inference trajectories. This approach yields state-of-the-art accuracy and efficient generation, demonstrated by substantial improvements over leading AR models across advanced math benchmarks. The architecture, training paradigms, and open-source framework position TraDo-8B-Instruct as a reference model for the diffusion LLM family.
1. Diffusion Block-Attention Model Structure
TraDo-8B-Instruct is built on a block-attention DLM architecture, where token sequences are partitioned into fixed-size blocks (e.g., ). Each block is iteratively unmasked via a denoising process, supported by bidirectional attention that conditions on all visible tokens within each block. This design supports both semi-autoregressive and parallel decoding, enabled by an accelerated key-value (KV) cache mechanism that efficiently slices sequences into windowed subsets.
Key architectural features include:
- Block-attention mechanism: Each attention head processes blocks, facilitating information flow between previously completed blocks and the current block.
- Sampling strategies: The model supports both static sampling (one token per step) and dynamic sampling (parallel unmasking above a confidence threshold).
- Hybrid AR and diffusion adaptation: Training starts from an AR backbone with subsequent diffusion-based adaptation, preserving reasoning skills and enabling longer chain-of-thought (CoT) solutions.
- Open-source framework: A unified codebase (https://github.com/Gen-Verse/dLLM-RL) provides support for block diffusion models, full-attention variants, AR-based models, and integration with practical inference engines.
2. TraceRL: Trajectory-Aware Reinforcement Learning
TraceRL is a reinforcement learning protocol optimized for diffusion LLMs, incorporating both sequence-level and token-level feedback from the full inference trajectory:
- Trajectory definition: The model produces , where is the set of tokens generated at step .
- Shrinkage aggregation: A shrinkage parameter is used to aggregate trajectory steps for computational efficiency, reducing the number of forward passes without sacrificing trace fidelity.
- Diffusion-based value model: Token-wise advantages are computed and regressed using:
where is the value model, the reward, and "clip" stabilizes updates.
- Policy gradient objective:
where is a clipped surrogate function and the advantage.
3. Training Pipeline and Curriculum Learning
TraDo-8B-Instruct training proceeds via two distinct phases:
- Semi-autoregressive supervised fine-tuning (SFT): The model learns blockwise, left-to-right token prediction using the objective
where is the input sequence, is the prompt, is sequence length, and is block size.
- TraceRL post-training: The RL protocol records full inference traces, computes shrunken advantages, and applies token-/step-wise rewards for policy improvement.
The curriculum includes block size enlargement (e.g., ), which TraceRL adapts flexibly, yielding efficient parallel generation and improved sampling diversity.
4. Benchmarking and Empirical Results
TraDo-8B-Instruct is evaluated on demanding mathematical and coding benchmarks:
- MATH500: Achieves 78.5% accuracy (static) and 75.5% (dynamic), translating to a 6.1% relative improvement over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct.
- AIME2024 and LiveCodeBench: Consistently outperforms similar-sized AR models, despite its smaller parameter count.
- Long-CoT variant: Through curriculum learning, "TraDo-8B-Thinking" achieves an 18.1% relative gain over Qwen2.5-7B-Instruct on MATH500 for multi-step reasoning.
These gains are supported by speed-up statistics in dynamic sampling, such as reduced average steps and increased token parallelism (see Table 2 in (Wang et al., 8 Sep 2025)).
Model | Static Acc (%) | Dynamic Acc (%) | Math Benchmark |
---|---|---|---|
TraDo-8B-Instruct | 78.5 | 75.5 | MATH500 |
Qwen2.5-7B-Instruct | ~74 | ~71 | MATH500 |
Llama3.1-8B-Instruct | 51.9 | 6.7 | MATH500 |
5. Algorithmic Framework and Implementation
The open-source dLLM framework supports construction, training, and inference for diffusion LLMs:
- Multi-architecture support: Models can utilize block-attention, full-attention, AR-based, and hybrid designs.
- Post-training optimization: Includes TraceRL, random masking RL, and associated baseline methods.
- Accelerated inference: Implements fast KV cache strategies and parallel sequence slice processing.
- Benchmarks and evaluation modules: Directly integrates mathematics, coding, and RL tasks for reproducibility and deployment.
Codebase and model weights are available at https://github.com/Gen-Verse/dLLM-RL, covering both research and applied use cases.
6. Context, Impact, and Outlook
The TraceRL framework and TraDo series DLMs signify a methodological advancement in LLM training—moving beyond AR exclusive RL by exploiting full trajectory information, supporting both token-level and sequence-level optimization. The documented empirical improvements over established models (Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct) demonstrate both the effectiveness of trace-aware RL in diffusion architectures and the promise of block-attention strategies for long-form reasoning and parallel sampling.
A plausible implication is that diffusion LLMs, coupled with trajectory-level RL, may offer superior solutions for tasks requiring long, internally consistent reasoning chains, and can be further scaled via curriculum learning, block adjustment, and advanced reward shaping. The code and model release facilitates reproducibility and domain adaptation, supporting rapid progress in both research and practical deployment.