Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

TraDo-4B-Instruct Diffusion LLM with TraceRL

Updated 9 September 2025
  • TraDo-4B-Instruct is a diffusion-based language model with 4B parameters that uses a trajectory-aware RL framework for advanced instruction following.
  • It employs TraceRL to assign process-level rewards during iterative denoising, improving sample efficiency and accuracy over autoregressive methods.
  • The model achieves state-of-the-art results in mathematical and coding benchmarks, outperforming larger AR-based instruction-tuned models.

TraDo-4B-Instruct is a 4B-parameter diffusion LLM (DLM) optimized for instruction following and complex reasoning through a trajectory-aware reinforcement learning framework (TraceRL) (Wang et al., 8 Sep 2025). Distinct from autoregressive models, DLMs generate token-sets via iterative, probabilistic denoising steps. TraDo-4B-Instruct leverages a tailored RL scheme with process-level reward assignment and a diffusion-specific value model, resulting in state-of-the-art performance in mathematical and coding tasks relative to comparably-sized and larger AR-based instruction-tuned models.

1. TraceRL: Trajectory-Aware RL for Diffusion LLMs

TraceRL is a reinforcement learning method designed for DLMs that explicitizes and exploits the generation trajectory. Instead of conventional RL approaches—where reward is allocated at the sequence level—TraceRL decomposes the inference into a trajectory of intermediate "trace steps." Each trace τt\tau_t consists of a set of tokens generated at denoising step tt.

A shrinkage parameter ss aggregates ss consecutive steps, reducing the number of forward passes necessary for policy updates and improving computational efficiency without sacrificing granularity of credit assignment. The RL objective is based on PPO—with clipped policy ratios and KL regularization—applied at each trace step. Let πθ\pi_{\theta} and πold\pi_{\text{old}} denote the updated and reference policies. The clipped objective for each trace step can be written as: j=1Kmin(πθ(τjτj1)πold(τjτj1)Aj,clip(πθ(τjτj1)πold(τjτj1),1ϵ,1+ϵ)Aj)\sum_{j=1}^K \min\left( \frac{\pi_\theta(\tau_j | \tau_{j-1})}{\pi_\text{old}(\tau_j | \tau_{j-1})}A_j, \text{clip}\left(\frac{\pi_\theta(\tau_j | \tau_{j-1})}{\pi_\text{old}(\tau_j | \tau_{j-1})}, 1-\epsilon, 1+\epsilon\right)A_j\right) where AjA_j is the process-level advantage for trace step jj, possibly computed using the diffusion value model as a baseline.

The diffusion value model outputs token-wise (or step-wise) value estimates conditioned on the trace prefix, baselining variance as shown in Proposition 1 of (Wang et al., 8 Sep 2025): Rj=rj+k=1τtjγk1τ(tj+k)lτ(tj+k)rlR_j = r_j + \sum_{k=1}^{|\tau|-t_j} \gamma^k \frac{1}{|\tau(t_j + k)|} \sum_{l \in \tau(t_j + k)} r_l

2. Model Performance and Benchmarking

TraDo-4B-Instruct achieves high accuracy in mathematical reasoning and code generation. Notably:

Model GSM8K MATH500 AIME2024 LiveCodeBench LiveBench
TraDo-4B-Instruct 89.5 75.6 64.3 81.4 69.8
Qwen2.5-7B-Instruct 89.8 74.0 61.1 79.5 67.7
Llama3.1-8B-Instruct 86.2 51.9 43.9 53.8 43.2

Metrics correspond to dynamic decoding accuracy in % (from Table 1 of (Wang et al., 8 Sep 2025)). TraDo-4B-Instruct consistently surpasses Qwen2.5-7B-Instruct (7B AR) and Llama3.1-8B-Instruct (8B AR) despite its smaller size.

On MATH500, the TraDo-8B-Thinking variant (a long chain-of-thought DLM derived via curriculum learning and TraceRL) achieves an 18.1% relative accuracy gain over Qwen2.5-7B-Instruct.

3. Process-Level Credit Assignment and Value Modeling

The use of process-level rewards and the diffusion value model is central to TraceRL. Unlike methods with reward at sequence completion, process-level rewards assign supervision for partial generations, allowing RL updates to reflect decisions at each trace step.

The advantage AjA_j for each trace step is computed via: Aj=[RjV(τ<j)]A_j = \left[R_j - V(\tau_{<j}) \right] where RjR_j is the cumulative process-level reward for trace step jj and V(τ<j)V(\tau_{<j}) is the baseline given by the diffusion-conditioned value network.

This fine-grained reward structure enhances sample efficiency and stabilizes RL by reducing variance in policy updates. The method supports consistent credit assignment whether operating at token-wise or block-wise granularity.

4. Flexible Block Diffusion and Accelerated Inference

TraDo-4B-Instruct supports block diffusion architectures with flexible block size adaptation. TraceRL enables transition from models trained with small blocks (B=4B=4) to larger blocks (B=8B=8 or more), improving sampling flexibility and inference speed. The adaptation protocol first collects rollouts on small blocks, then enlarges block size during RL fine-tuning.

Accelerated inference is achieved via KV-cache techniques for full-attention DLMs and JetEngine for block diffusion models. The framework's accelerated decoding exploits the further horizon size in KV-caching and block-wise parallelization, minimizing latency for both sampling and RL rollouts.

5. Curriculum Learning and Long Chain-of-Thought Reasoning

TraceRL is extended to curriculum learning—progressively growing trace step length and complexity during RL training—which enables DLMs to generate extended chains of thought (CoT). The long-CoT DLM, TraDo-8B-Thinking, outperforms previously state-of-the-art models (Qwen2.5-7B-Instruct) on multi-step reasoning benchmarks.

The curriculum approach incrementally increases problem difficulty and reasoning chain length, allowing the DLM to learn robust multi-step strategies under process-level reward assignment.

6. Open-Source Training, Deployment, and Practical Use

An open-source implementation is released with code for supervised fine-tuning, multiple RL schemes (including TraceRL), and inference engines for diverse DLM architectures. The framework supports mathematics, code synthesis, and general instruction tasks, integrating process-reward value modeling, block size adaptation, and accelerated decoding routines.

Standard fine-tuning and RL is conducted using the objective (Equation (1) in (Wang et al., 8 Sep 2025)): LTraceRL=Eτ[jmin(policy ratioAj,clipped ratioAj)λKLKL[πθ(τjτj1)πold(τjτj1)]]\mathcal{L}_{\text{TraceRL}} = \mathbb{E}_{\tau} \left[\sum_{j} \min(\text{policy ratio} \cdot A_j, \text{clipped ratio} \cdot A_j) - \lambda_{\text{KL}}\text{KL}[\pi_{\theta}(\tau_j| \tau_{j-1}) || \pi_{\text{old}}(\tau_j| \tau_{j-1})] \right]

Experiments are reproducible and extendable for customized DLMs and RL algorithms.

7. Prospects and Significance

TraDo-4B-Instruct demonstrates that trajectory-aware RL with process-level reward allocation and diffusion-based value modeling enables compact DLMs to compete with or outperform larger AR models on highly structured reasoning tasks. The framework’s flexibility in block size, accelerated inference, and curriculum learning expands the practical deployment landscape for DLMs.

This suggests that RL alignment for generative LLMs benefits from close linkage between training trajectories and inference traces, and that diffusion models can efficiently absorb multi-step reasoning capabilities under process-level supervision.

Emerging research directions include deeper integration of curriculum learning, further scalability of block adaptation methods, and extension to multimodal and generalized alignment tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TraDo-4B-Instruct.