ESPO: ELBO-based Sequence-level Policy Optimization
- The paper introduces ESPO, a reinforcement learning framework that treats entire sequence generation as a single decision using an ELBO surrogate to overcome token-level limitations.
- It employs importance ratio normalization, PPO-style clipping, and robust quadratic KL regularization to ensure stable and effective policy updates.
- Empirical results show significant gains in tasks like planning, math, and coding, highlighting ESPO’s scalability and practical advantages.
ELBO-based Sequence-level Policy Optimization (ESPO) is a reinforcement learning (RL) framework designed for fine-tuning diffusion LLMs (DLMs) by treating entire sequence generation as a single atomic decision, using the evidence lower bound (ELBO) as a tractable proxy for the intractable sequence-level likelihood. ESPO addresses key incompatibilities between RL algorithms designed for autoregressive LLMs and the non-autoregressive, iterative denoising characteristic of DLMs, enabling principled and stable policy optimization in domains such as mathematical reasoning, coding, and planning (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).
1. Formal Objective and Theoretical Foundation
Standard RL policy gradients optimize
where is a reward for completion given prompt . Token-level RL objectives are directly applicable to autoregressive models due to their explicit token conditional factorization. However, DLMs lack such a factorization: sequence likelihood is defined implicitly through diffusion-style denoising, making per-token RL inapplicable.
ESPO overcomes this by directly optimizing sequence-level rewards. The policy is updated via an off-policy group-relative (GRPO-style) surrogate, but with the complete output sequence treated as a single decision:
where the sequence-level importance ratio is
Direct computation of is intractable. Instead, ESPO uses the standard evidence lower bound (ELBO) in a -masked form:
with the guarantee . Substituting for yields a practical, lower-bounding sequence likelihood proxy.
Sequence-level RL with ELBO surrogates thus restores formal consistency (avoiding token-level decompositions) and enables non-autoregressive policy optimization.
2. Importance-Ratio Normalization and Clipping
Because the difference grows with sequence length , the naive sequence-ratio
can produce extreme values for long sequences. To remedy this, ESPO normalizes per-token:
This stabilization is critical for practical training. Further, PPO-style clipping is applied to limit to during the surrogate loss calculation, preventing instability from rare large-ratio outliers.
Ablations report that per-token normalization is necessary: without it, ratios explode or vanish, undermining optimization (Ou et al., 3 Dec 2025).
3. Efficient ELBO Estimation and Variance Reduction
Estimating for each sequence is computationally demanding. The ELBO is defined as an expectation over continuous mask-fractions and random masking patterns , with
Naive double Monte Carlo estimation suffers from severe variance—time-variance from sampling dominates, requiring network forward passes per sequence for stable estimates (Rojas et al., 9 Oct 2025).
Semi-deterministic Monte Carlo (SDMC) / Quadrature Scheme: ESPO and its extensions (e.g., Group Diffusion Policy Optimization, GDPO) employ deterministic quadrature for the integral over and minimal Monte Carlo sampling over masking, so that
with or $3$ quadrature points and inner mask sample, leading to dramatically lower estimator variance without additional compute.
Variance Decomposition and Theoretical Guarantees: The total mean squared error for SDMC decomposes into a Monte Carlo variance scaling as and a quadrature bias term that rapidly decays with (midpoint/trapezoidal: , Simpson: ). This ensures practical, low-variance estimates with few network evaluations (Rojas et al., 9 Oct 2025).
| Estimator | Variance Term | Bias Term |
|---|---|---|
| Riemann (generic) | ||
| Smooth/Quadrature |
4. Robust KL Regularization
A KL regularizer penalizes deviation of from a reference to maintain conservative policy updates. While exponential-based KL estimators (e.g., ) can be unstable due to effects for long sequences, ESPO instead applies the quadratic estimator:
This provides unbiased gradients for , is free of exponentials, and remains robust for extended outputs (Ou et al., 3 Dec 2025).
5. Algorithmic Workflow and Practical Implementation
A typical ESPO (or GDPO) RL iteration proceeds as follows:
- For each prompt , sample completions from the behavior policy via a diffusion sampler.
- Compute rewards and centered advantages .
- For each , estimate and using SDMC quadrature.
- Compute length-normalized, clipped importance ratios and robust quadratic KL divergence.
- Formulate the surrogate loss as the mean of clipped-importance-weighted advantages, add the KL penalty, and take a gradient step.
Key training hyperparameters include group size (e.g., $6$–$16$), number of quadrature points (–$3$), Monte Carlo masks per (–$2$), GRPO clipping parameter , and KL penalty weight .
6. Empirical Results and Benchmark Performance
On arithmetic, planning, and code-completion tasks, ESPO demonstrates consistent, often dramatic, improvements over token-level RL baselines and one-step unmasking methods (diffu-GRPO).
| Method | GSM8K | MATH | Countdown | Sudoku | HumanEval-avg | MBPP-avg |
|---|---|---|---|---|---|---|
| Base | 75.9 | 37.0 | 18.7 | 15.7 | 37.8 | 37.8 |
| +d1 (GRPO) | 78.0 | 37.7 | 33.9 | 22.2 | 37.2 | 36.5 |
| +wd1 | 80.1 | 36.9 | 48.3 | 23.1 | 37.2 | 36.5 |
| +ESPO | 82.0 | 39.5 | 81.0 | 86.0 | 40.1 | 45.4 |
| Δ vs base | +6.1 | +2.5 | +62.3 | +70.3 | +2.3 | +7.6 |
These results show especially large gains in planning (Countdown, Sudoku: to points), and consistent improvements in math and coding, with evaluation at sequence lengths up to , confirming scalability (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025). Sequence-level+ELBO consistently outperforms any token-level proxy (mean-field or token-ELBO).
7. Limitations, Extensions, and Future Directions
- The practical efficacy of ESPO is closely tied to the tightness of the ELBO as a likelihood surrogate—a plausible implication is that further variance or bias outliers could disrupt optimization for some distributions, though no such breakdown was observed empirically.
- While ESPO’s convergence is empirically robust, no formal convergence guarantees are established beyond those of standard PPO.
- Future directions include integrating learned value functions, extending to multimodal diffusion LLMs, and exploring cost reductions via distillation or adaptive masking.
- For tasks demanding long-range, sequence-level coherence (e.g., planning, full-program synthesis), ESPO’s atomic sequence optimization provides a distinct practical advantage over denoising-token RL methods.
This framework establishes sequence-level RL based on ELBO surrogates as a new paradigm for RL in diffusion LLMs, combining theoretical justification, stable optimization, and large empirical gains across challenging domains (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).