TraDo-4B-Instruct Diffusion LLM with TraceRL

Updated 9 September 2025

TraDo-4B-Instruct is a diffusion-based language model with 4B parameters that uses a trajectory-aware RL framework for advanced instruction following.
It employs TraceRL to assign process-level rewards during iterative denoising, improving sample efficiency and accuracy over autoregressive methods.
The model achieves state-of-the-art results in mathematical and coding benchmarks, outperforming larger AR-based instruction-tuned models.

TraDo-4B-Instruct is a 4B-parameter diffusion LLM (DLM) optimized for instruction following and complex reasoning through a trajectory-aware reinforcement learning framework (TraceRL) (Wang et al., 8 Sep 2025). Distinct from autoregressive models, DLMs generate token-sets via iterative, probabilistic denoising steps. TraDo-4B-Instruct leverages a tailored RL scheme with process-level reward assignment and a diffusion-specific value model, resulting in state-of-the-art performance in mathematical and coding tasks relative to comparably-sized and larger AR-based instruction-tuned models.

1. TraceRL: Trajectory-Aware RL for Diffusion LLMs

TraceRL is a reinforcement learning method designed for DLMs that explicitizes and exploits the generation trajectory. Instead of conventional RL approaches—where reward is allocated at the sequence level—TraceRL decomposes the inference into a trajectory of intermediate "trace steps." Each trace $\tau_t$ consists of a set of tokens generated at denoising step $t$ .

A shrinkage parameter $s$ aggregates $s$ consecutive steps, reducing the number of forward passes necessary for policy updates and improving computational efficiency without sacrificing granularity of credit assignment. The RL objective is based on PPO—with clipped policy ratios and KL regularization—applied at each trace step. Let $\pi_{\theta}$ and $\pi_{\text{old}}$ denote the updated and reference policies. The clipped objective for each trace step can be written as: $\sum_{j=1}^K \min\left( \frac{\pi_\theta(\tau_j | \tau_{j-1})}{\pi_\text{old}(\tau_j | \tau_{j-1})}A_j, \text{clip}\left(\frac{\pi_\theta(\tau_j | \tau_{j-1})}{\pi_\text{old}(\tau_j | \tau_{j-1})}, 1-\epsilon, 1+\epsilon\right)A_j\right)$ where $A_j$ is the process-level advantage for trace step $j$ , possibly computed using the diffusion value model as a baseline.

The diffusion value model outputs token-wise (or step-wise) value estimates conditioned on the trace prefix, baselining variance as shown in Proposition 1 of (Wang et al., 8 Sep 2025): $R_j = r_j + \sum_{k=1}^{|\tau|-t_j} \gamma^k \frac{1}{|\tau(t_j + k)|} \sum_{l \in \tau(t_j + k)} r_l$

2. Model Performance and Benchmarking

TraDo-4B-Instruct achieves high accuracy in mathematical reasoning and code generation. Notably:

Model	GSM8K	MATH500	AIME2024	LiveCodeBench	LiveBench
TraDo-4B-Instruct	89.5	75.6	64.3	81.4	69.8
Qwen2.5-7B-Instruct	89.8	74.0	61.1	79.5	67.7
Llama3.1-8B-Instruct	86.2	51.9	43.9	53.8	43.2

Metrics correspond to dynamic decoding accuracy in % (from Table 1 of (Wang et al., 8 Sep 2025)). TraDo-4B-Instruct consistently surpasses Qwen2.5-7B-Instruct (7B AR) and Llama3.1-8B-Instruct (8B AR) despite its smaller size.

On MATH500, the TraDo-8B-Thinking variant (a long chain-of-thought DLM derived via curriculum learning and TraceRL) achieves an 18.1% relative accuracy gain over Qwen2.5-7B-Instruct.

3. Process-Level Credit Assignment and Value Modeling

The use of process-level rewards and the diffusion value model is central to TraceRL. Unlike methods with reward at sequence completion, process-level rewards assign supervision for partial generations, allowing RL updates to reflect decisions at each trace step.

The advantage $A_j$ for each trace step is computed via: $A_j = \left[R_j - V(\tau_{<j}) \right]$ where $R_j$ is the cumulative process-level reward for trace step $j$ and $V(\tau_{<j})$ is the baseline given by the diffusion-conditioned value network.

This fine-grained reward structure enhances sample efficiency and stabilizes RL by reducing variance in policy updates. The method supports consistent credit assignment whether operating at token-wise or block-wise granularity.

4. Flexible Block Diffusion and Accelerated Inference

TraDo-4B-Instruct supports block diffusion architectures with flexible block size adaptation. TraceRL enables transition from models trained with small blocks ( $B=4$ ) to larger blocks ( $B=8$ or more), improving sampling flexibility and inference speed. The adaptation protocol first collects rollouts on small blocks, then enlarges block size during RL fine-tuning.

Accelerated inference is achieved via KV-cache techniques for full-attention DLMs and JetEngine for block diffusion models. The framework's accelerated decoding exploits the further horizon size in KV-caching and block-wise parallelization, minimizing latency for both sampling and RL rollouts.

5. Curriculum Learning and Long Chain-of-Thought Reasoning

TraceRL is extended to curriculum learning—progressively growing trace step length and complexity during RL training—which enables DLMs to generate extended chains of thought (CoT). The long-CoT DLM, TraDo-8B-Thinking, outperforms previously state-of-the-art models (Qwen2.5-7B-Instruct) on multi-step reasoning benchmarks.

The curriculum approach incrementally increases problem difficulty and reasoning chain length, allowing the DLM to learn robust multi-step strategies under process-level reward assignment.

6. Open-Source Training, Deployment, and Practical Use

An open-source implementation is released with code for supervised fine-tuning, multiple RL schemes (including TraceRL), and inference engines for diverse DLM architectures. The framework supports mathematics, code synthesis, and general instruction tasks, integrating process-reward value modeling, block size adaptation, and accelerated decoding routines.

Standard fine-tuning and RL is conducted using the objective (Equation (1) in (Wang et al., 8 Sep 2025)): $\mathcal{L}_{\text{TraceRL}} = \mathbb{E}_{\tau} \left[\sum_{j} \min(\text{policy ratio} \cdot A_j, \text{clipped ratio} \cdot A_j) - \lambda_{\text{KL}}\text{KL}[\pi_{\theta}(\tau_j| \tau_{j-1}) || \pi_{\text{old}}(\tau_j| \tau_{j-1})] \right]$

Experiments are reproducible and extendable for customized DLMs and RL algorithms.

7. Prospects and Significance

TraDo-4B-Instruct demonstrates that trajectory-aware RL with process-level reward allocation and diffusion-based value modeling enables compact DLMs to compete with or outperform larger AR models on highly structured reasoning tasks. The framework’s flexibility in block size, accelerated inference, and curriculum learning expands the practical deployment landscape for DLMs.

This suggests that RL alignment for generative LLMs benefits from close linkage between training trajectories and inference traces, and that diffusion models can efficiently absorb multi-step reasoning capabilities under process-level supervision.

Emerging research directions include deeper integration of curriculum learning, further scalability of block adaptation methods, and extension to multimodal and generalized alignment tasks.

PDF Markdown Chat (Pro)

References (1)

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TraDo-4B-Instruct.