ESPO: ELBO-based Sequence-level Policy Optimization

Updated 10 December 2025

The paper introduces ESPO, a reinforcement learning framework that treats entire sequence generation as a single decision using an ELBO surrogate to overcome token-level limitations.
It employs importance ratio normalization, PPO-style clipping, and robust quadratic KL regularization to ensure stable and effective policy updates.
Empirical results show significant gains in tasks like planning, math, and coding, highlighting ESPO’s scalability and practical advantages.

ELBO-based Sequence-level Policy Optimization (ESPO) is a reinforcement learning (RL) framework designed for fine-tuning diffusion LLMs (DLMs) by treating entire sequence generation as a single atomic decision, using the evidence lower bound (ELBO) as a tractable proxy for the intractable sequence-level likelihood. ESPO addresses key incompatibilities between RL algorithms designed for autoregressive LLMs and the non-autoregressive, iterative denoising characteristic of DLMs, enabling principled and stable policy optimization in domains such as mathematical reasoning, coding, and planning (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).

1. Formal Objective and Theoretical Foundation

Standard RL policy gradients optimize

$J(\pi_\theta) = \mathbb{E}_{x \sim D}\;\mathbb{E}_{y \sim \pi_\theta(\cdot|x)}[R(x, y)],$

where $R(x, y)$ is a reward for completion $y$ given prompt $x$ . Token-level RL objectives are directly applicable to autoregressive models due to their explicit token conditional factorization. However, DLMs lack such a factorization: sequence likelihood $\pi_\theta(y | x)$ is defined implicitly through diffusion-style denoising, making per-token RL inapplicable.

ESPO overcomes this by directly optimizing sequence-level rewards. The policy is updated via an off-policy group-relative (GRPO-style) surrogate, but with the complete output sequence $y$ treated as a single decision:

$J_{\text{seq}}(\theta) = \mathbb{E}_{x,\,y^{(1:G)} \sim \pi_{\text{old}}} \left[ \frac{1}{G}\sum_{i=1}^G \min\left(\rho_{\text{seq}}(y^{(i)}) \cdot \hat{A}^{(i)},\, \text{clip}(\rho_{\text{seq}}(y^{(i)}), 1-\epsilon, 1+\epsilon)\cdot \hat{A}^{(i)}\right) \right],$

where the sequence-level importance ratio is

$\rho_{\text{seq}}(y) = \frac{\pi_\theta(y|x)}{\pi_{\text{old}}(y|x)}.$

Direct computation of $\pi_\theta(y|x)$ is intractable. Instead, ESPO uses the standard evidence lower bound (ELBO) in a $k$ -masked form:

$\mathcal{L}_\theta(y|x) = \mathbb{E}_{l \sim \mathrm{Unif}(1 \ldots L)}\, \mathbb{E}_{y_l\sim q_l(y_l|y,x)}\left[ \left(\frac{L}{l}\right)\sum_{i=1}^L \mathbf{1}[y_l^i = M]\cdot \log p_\theta(y^i|y_l, x)\right],$

with the guarantee $\mathcal{L}_\theta(y|x) \leq \log \pi_\theta(y|x)$ . Substituting $\exp\left(\mathcal{L}_\theta(y|x)\right)$ for $\pi_\theta(y|x)$ yields a practical, lower-bounding sequence likelihood proxy.

Sequence-level RL with ELBO surrogates thus restores formal consistency (avoiding token-level decompositions) and enables non-autoregressive policy optimization.

2. Importance-Ratio Normalization and Clipping

Because the difference $\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x)$ grows with sequence length $L$ , the naive sequence-ratio

$\rho_{\text{raw}}(y) = \exp(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x))$

can produce extreme values for long sequences. To remedy this, ESPO normalizes per-token:

$\rho_{\text{seq}}(y) = \exp\left(\frac{1}{L}\left(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x)\right)\right).$

This stabilization is critical for practical training. Further, PPO-style clipping is applied to limit $\rho_{\text{seq}}$ to $[1-\epsilon, 1+\epsilon]$ during the surrogate loss calculation, preventing instability from rare large-ratio outliers.

Ablations report that per-token normalization is necessary: without it, ratios explode or vanish, undermining optimization (Ou et al., 3 Dec 2025).

3. Efficient ELBO Estimation and Variance Reduction

Estimating $\mathcal{L}_\theta(y|x)$ for each sequence is computationally demanding. The ELBO is defined as an expectation over continuous mask-fractions $t \in [0, 1]$ and random masking patterns $y_t \sim \pi_t(\cdot | y)$ , with

$\mathcal{L}_{\rm ELBO}(y|q) = \mathbb{E}_{t \sim U[0,1]} \mathbb{E}_{y_t \sim \pi_t(\cdot | y)} \left[ \frac{1}{t} \sum_{i=1}^L \mathbf{1}[y_t^i = M]\log \pi_\theta(y^i | y_t, q) \right].$

Naive double Monte Carlo estimation suffers from severe variance—time-variance from sampling $t$ dominates, requiring $N K \gg 100$ network forward passes per sequence for stable estimates (Rojas et al., 9 Oct 2025).

Semi-deterministic Monte Carlo (SDMC) / Quadrature Scheme: ESPO and its extensions (e.g., Group Diffusion Policy Optimization, GDPO) employ deterministic quadrature for the integral over $t$ and minimal Monte Carlo sampling over masking, so that

$\int_0^1 g(t)dt \approx \sum_{n=1}^N w_n \frac{1}{K}\sum_{k=1}^K Z(t_n, y_t^{[k]}),$

with $N=2$ or $3$ quadrature points and $K=1$ inner mask sample, leading to dramatically lower estimator variance without additional compute.

Variance Decomposition and Theoretical Guarantees: The total mean squared error for SDMC decomposes into a Monte Carlo variance scaling as $O(1/(N K))$ and a quadrature bias term that rapidly decays with $N$ (midpoint/trapezoidal: $O(1/N^4)$ , Simpson: $O(1/N^8)$ ). This ensures practical, low-variance estimates with few network evaluations (Rojas et al., 9 Oct 2025).

Estimator	Variance Term	Bias $^2$ Term
Riemann (generic)	$O(1/(N K))$	$O(1/N^2)$
Smooth/Quadrature	$O(1/(N^2 K))$	$O(1/N^4)$

4. Robust KL Regularization

A KL regularizer penalizes deviation of $\pi_\theta$ from a reference $\pi_{\text{ref}}$ to maintain conservative policy updates. While exponential-based KL estimators (e.g., $k_3$ ) can be unstable due to $\exp(\cdot)$ effects for long sequences, ESPO instead applies the quadratic $k_2$ estimator:

$\widehat{\text{KL}}_{k_2} = \mathbb{E}_{y \sim \pi_\theta}\left[\frac{1}{2}(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\text{ref}}(y|x))^2\right].$

This provides unbiased gradients for $\mathrm{KL}(\pi_\theta \| \pi_\text{ref})$ , is free of exponentials, and remains robust for extended outputs (Ou et al., 3 Dec 2025).

5. Algorithmic Workflow and Practical Implementation

A typical ESPO (or GDPO) RL iteration proceeds as follows:

For each prompt $x$ , sample $G$ completions $\{y^{(i)}\}$ from the behavior policy via a diffusion sampler.
Compute rewards $R(x, y^{(i)})$ and centered advantages $\hat{A}^{(i)}$ .
For each $y^{(i)}$ , estimate $\mathcal{L}_\theta(y^{(i)}|x)$ and $\mathcal{L}_{\theta_{\text{old}}}(y^{(i)}|x)$ using SDMC quadrature.
Compute length-normalized, clipped importance ratios and robust quadratic KL divergence.
Formulate the surrogate loss as the mean of clipped-importance-weighted advantages, add the KL penalty, and take a gradient step.

Key training hyperparameters include group size $G$ (e.g., $6$–$16$), number of quadrature points ( $N=2$ –$3$), Monte Carlo masks per $t$ ( $K=1$ –$2$), GRPO clipping parameter $\epsilon=0.2$ , and KL penalty weight $\beta=0.02$ .

6. Empirical Results and Benchmark Performance

On arithmetic, planning, and code-completion tasks, ESPO demonstrates consistent, often dramatic, improvements over token-level RL baselines and one-step unmasking methods (diffu-GRPO).

Method	GSM8K $\uparrow$	MATH $\uparrow$	Countdown $\uparrow$	Sudoku $\uparrow$	HumanEval-avg $\uparrow$	MBPP-avg $\uparrow$
Base	75.9	37.0	18.7	15.7	37.8	37.8
+d1 (GRPO)	78.0	37.7	33.9	22.2	37.2	36.5
+wd1	80.1	36.9	48.3	23.1	37.2	36.5
+ESPO	82.0	39.5	81.0	86.0	40.1	45.4
Δ vs base	+6.1	+2.5	+62.3	+70.3	+2.3	+7.6

These results show especially large gains in planning (Countdown, Sudoku: $+62$ to $+70$ points), and consistent improvements in math and coding, with evaluation at sequence lengths up to $L=512$ , confirming scalability (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025). Sequence-level+ELBO consistently outperforms any token-level proxy (mean-field or token-ELBO).

7. Limitations, Extensions, and Future Directions

The practical efficacy of ESPO is closely tied to the tightness of the ELBO as a likelihood surrogate—a plausible implication is that further variance or bias outliers could disrupt optimization for some distributions, though no such breakdown was observed empirically.
While ESPO’s convergence is empirically robust, no formal convergence guarantees are established beyond those of standard PPO.
Future directions include integrating learned value functions, extending to multimodal diffusion LLMs, and exploring cost reductions via distillation or adaptive masking.
For tasks demanding long-range, sequence-level coherence (e.g., planning, full-program synthesis), ESPO’s atomic sequence optimization provides a distinct practical advantage over denoising-token RL methods.

This framework establishes sequence-level RL based on ELBO surrogates as a new paradigm for RL in diffusion LLMs, combining theoretical justification, stable optimization, and large empirical gains across challenging domains (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).