Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

TraDo-8B-Instruct: Diffusion LLM with TraceRL

Updated 9 September 2025
  • TraDo-8B-Instruct is an 8-billion-parameter diffusion language model that uses a novel block-attention design to iteratively unmask token blocks for enhanced reasoning and parallel decoding.
  • It employs the TraceRL framework to integrate trajectory-aware reinforcement learning, aggregating token-level rewards over full inference trajectories for refined optimization.
  • Empirical benchmarks show state-of-the-art performance in math and coding tasks, with significant improvements over established autoregressive models.

TraDo-8B-Instruct is an 8-billion-parameter diffusion LLM (DLM) designed to excel in complex mathematical reasoning and coding, leveraging a novel trajectory-aware reinforcement learning framework (TraceRL) and block-attention architecture. In contrast to conventional autoregressive (AR) models, TraDo-8B-Instruct applies iterative denoising to masked token blocks, integrating fine-grained RL signals over full inference trajectories. This approach yields state-of-the-art accuracy and efficient generation, demonstrated by substantial improvements over leading AR models across advanced math benchmarks. The architecture, training paradigms, and open-source framework position TraDo-8B-Instruct as a reference model for the diffusion LLM family.

1. Diffusion Block-Attention Model Structure

TraDo-8B-Instruct is built on a block-attention DLM architecture, where token sequences are partitioned into fixed-size blocks (e.g., B=4B=4). Each block is iteratively unmasked via a denoising process, supported by bidirectional attention that conditions on all visible tokens within each block. This design supports both semi-autoregressive and parallel decoding, enabled by an accelerated key-value (KV) cache mechanism that efficiently slices sequences into windowed subsets.

Key architectural features include:

  • Block-attention mechanism: Each attention head processes blocks, facilitating information flow between previously completed blocks and the current block.
  • Sampling strategies: The model supports both static sampling (one token per step) and dynamic sampling (parallel unmasking above a confidence threshold).
  • Hybrid AR and diffusion adaptation: Training starts from an AR backbone with subsequent diffusion-based adaptation, preserving reasoning skills and enabling longer chain-of-thought (CoT) solutions.
  • Open-source framework: A unified codebase (https://github.com/Gen-Verse/dLLM-RL) provides support for block diffusion models, full-attention variants, AR-based models, and integration with practical inference engines.

2. TraceRL: Trajectory-Aware Reinforcement Learning

TraceRL is a reinforcement learning protocol optimized for diffusion LLMs, incorporating both sequence-level and token-level feedback from the full inference trajectory:

  • Trajectory definition: The model produces τ=(τ(1),τ(2),...,τ(τ))\tau = (\tau(1), \tau(2), ..., \tau(|\tau|)), where τ(t)\tau(t) is the set of tokens generated at step tt.
  • Shrinkage aggregation: A shrinkage parameter ss is used to aggregate trajectory steps for computational efficiency, reducing the number of forward passes without sacrificing trace fidelity.
  • Diffusion-based value model: Token-wise advantages are computed and regressed using:

Jvalue(θv)=12Eτ[1τjmax((Vθv(τ)jRj)2,(Vj(clip)Rj)2)]\mathcal{J}_{\text{value}}(\theta_v) = \frac{1}{2} \mathbb{E}_{\tau}\left[\frac{1}{|\tau|}\sum_j \max((V_{\theta_v}(\tau)_j - R_j)^2, (V_j^{(\text{clip})} - R_j)^2)\right]

where VθvV_{\theta_v} is the value model, RjR_j the reward, and "clip" stabilizes updates.

Jpolicy(θp)=E[itokτis(t)Cε(πθp(okτis(1:t1))πold(okτis(1:t1)),Ai)/τis(t)]βKL[πθpπold]\mathcal{J}_{\text{policy}}(\theta_p) = \mathbb{E}\left[\sum_i\sum_t\sum_{o_k\in\tau_i^s(t)} C_\varepsilon\left(\frac{\pi_{\theta_p}(o_k|\tau_i^s(1:t-1))}{\pi_{\text{old}}(o_k|\tau_i^s(1:t-1))}, A_i\right)/|\tau_i^s(t)|\right] - \beta\,\mathrm{KL}[\pi_{\theta_p}\|\pi_{\text{old}}]

where CεC_\varepsilon is a clipped surrogate function and AiA_i the advantage.

3. Training Pipeline and Curriculum Learning

TraDo-8B-Instruct training proceeds via two distinct phases:

  • Semi-autoregressive supervised fine-tuning (SFT): The model learns blockwise, left-to-right token prediction using the objective

Jsemi(x,Q,θ)=i=1L/BJfull(x((i1)B:min(iB,L)),[Q,x(0:(i1)B)],θ)\mathcal{J}_{\text{semi}}(x, Q, \theta) = \sum_{i=1}^{\lceil L/B \rceil} \mathcal{J}_{\text{full}}\left( x^{((i-1)B : \min(iB, L))}, [Q, x^{(0:(i-1)B)}], \theta \right)

where xx is the input sequence, QQ is the prompt, LL is sequence length, and BB is block size.

  • TraceRL post-training: The RL protocol records full inference traces, computes shrunken advantages, and applies token-/step-wise rewards for policy improvement.

The curriculum includes block size enlargement (e.g., B=48B=4\to8), which TraceRL adapts flexibly, yielding efficient parallel generation and improved sampling diversity.

4. Benchmarking and Empirical Results

TraDo-8B-Instruct is evaluated on demanding mathematical and coding benchmarks:

  • MATH500: Achieves 78.5% accuracy (static) and 75.5% (dynamic), translating to a 6.1% relative improvement over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct.
  • AIME2024 and LiveCodeBench: Consistently outperforms similar-sized AR models, despite its smaller parameter count.
  • Long-CoT variant: Through curriculum learning, "TraDo-8B-Thinking" achieves an 18.1% relative gain over Qwen2.5-7B-Instruct on MATH500 for multi-step reasoning.

These gains are supported by speed-up statistics in dynamic sampling, such as reduced average steps and increased token parallelism (see Table 2 in (Wang et al., 8 Sep 2025)).

Model Static Acc (%) Dynamic Acc (%) Math Benchmark
TraDo-8B-Instruct 78.5 75.5 MATH500
Qwen2.5-7B-Instruct ~74 ~71 MATH500
Llama3.1-8B-Instruct 51.9 6.7 MATH500

5. Algorithmic Framework and Implementation

The open-source dLLM framework supports construction, training, and inference for diffusion LLMs:

  • Multi-architecture support: Models can utilize block-attention, full-attention, AR-based, and hybrid designs.
  • Post-training optimization: Includes TraceRL, random masking RL, and associated baseline methods.
  • Accelerated inference: Implements fast KV cache strategies and parallel sequence slice processing.
  • Benchmarks and evaluation modules: Directly integrates mathematics, coding, and RL tasks for reproducibility and deployment.

Codebase and model weights are available at https://github.com/Gen-Verse/dLLM-RL, covering both research and applied use cases.

6. Context, Impact, and Outlook

The TraceRL framework and TraDo series DLMs signify a methodological advancement in LLM training—moving beyond AR exclusive RL by exploiting full trajectory information, supporting both token-level and sequence-level optimization. The documented empirical improvements over established models (Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct) demonstrate both the effectiveness of trace-aware RL in diffusion architectures and the promise of block-attention strategies for long-form reasoning and parallel sampling.

A plausible implication is that diffusion LLMs, coupled with trajectory-level RL, may offer superior solutions for tasks requiring long, internally consistent reasoning chains, and can be further scaled via curriculum learning, block adjustment, and advanced reward shaping. The code and model release facilitates reproducibility and domain adaptation, supporting rapid progress in both research and practical deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TraDo-8B-Instruct.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube