Papers
Topics
Authors
Recent
2000 character limit reached

ESPO: ELBO-based Sequence-level Policy Optimization

Updated 10 December 2025
  • The paper introduces ESPO, a reinforcement learning framework that treats entire sequence generation as a single decision using an ELBO surrogate to overcome token-level limitations.
  • It employs importance ratio normalization, PPO-style clipping, and robust quadratic KL regularization to ensure stable and effective policy updates.
  • Empirical results show significant gains in tasks like planning, math, and coding, highlighting ESPO’s scalability and practical advantages.

ELBO-based Sequence-level Policy Optimization (ESPO) is a reinforcement learning (RL) framework designed for fine-tuning diffusion LLMs (DLMs) by treating entire sequence generation as a single atomic decision, using the evidence lower bound (ELBO) as a tractable proxy for the intractable sequence-level likelihood. ESPO addresses key incompatibilities between RL algorithms designed for autoregressive LLMs and the non-autoregressive, iterative denoising characteristic of DLMs, enabling principled and stable policy optimization in domains such as mathematical reasoning, coding, and planning (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).

1. Formal Objective and Theoretical Foundation

Standard RL policy gradients optimize

J(πθ)=ExD  Eyπθ(x)[R(x,y)],J(\pi_\theta) = \mathbb{E}_{x \sim D}\;\mathbb{E}_{y \sim \pi_\theta(\cdot|x)}[R(x, y)],

where R(x,y)R(x, y) is a reward for completion yy given prompt xx. Token-level RL objectives are directly applicable to autoregressive models due to their explicit token conditional factorization. However, DLMs lack such a factorization: sequence likelihood πθ(yx)\pi_\theta(y | x) is defined implicitly through diffusion-style denoising, making per-token RL inapplicable.

ESPO overcomes this by directly optimizing sequence-level rewards. The policy is updated via an off-policy group-relative (GRPO-style) surrogate, but with the complete output sequence yy treated as a single decision:

Jseq(θ)=Ex,y(1:G)πold[1Gi=1Gmin(ρseq(y(i))A^(i),clip(ρseq(y(i)),1ϵ,1+ϵ)A^(i))],J_{\text{seq}}(\theta) = \mathbb{E}_{x,\,y^{(1:G)} \sim \pi_{\text{old}}} \left[ \frac{1}{G}\sum_{i=1}^G \min\left(\rho_{\text{seq}}(y^{(i)}) \cdot \hat{A}^{(i)},\, \text{clip}(\rho_{\text{seq}}(y^{(i)}), 1-\epsilon, 1+\epsilon)\cdot \hat{A}^{(i)}\right) \right],

where the sequence-level importance ratio is

ρseq(y)=πθ(yx)πold(yx).\rho_{\text{seq}}(y) = \frac{\pi_\theta(y|x)}{\pi_{\text{old}}(y|x)}.

Direct computation of πθ(yx)\pi_\theta(y|x) is intractable. Instead, ESPO uses the standard evidence lower bound (ELBO) in a kk-masked form:

Lθ(yx)=ElUnif(1L)Eylql(yly,x)[(Ll)i=1L1[yli=M]logpθ(yiyl,x)],\mathcal{L}_\theta(y|x) = \mathbb{E}_{l \sim \mathrm{Unif}(1 \ldots L)}\, \mathbb{E}_{y_l\sim q_l(y_l|y,x)}\left[ \left(\frac{L}{l}\right)\sum_{i=1}^L \mathbf{1}[y_l^i = M]\cdot \log p_\theta(y^i|y_l, x)\right],

with the guarantee Lθ(yx)logπθ(yx)\mathcal{L}_\theta(y|x) \leq \log \pi_\theta(y|x). Substituting exp(Lθ(yx))\exp\left(\mathcal{L}_\theta(y|x)\right) for πθ(yx)\pi_\theta(y|x) yields a practical, lower-bounding sequence likelihood proxy.

Sequence-level RL with ELBO surrogates thus restores formal consistency (avoiding token-level decompositions) and enables non-autoregressive policy optimization.

2. Importance-Ratio Normalization and Clipping

Because the difference Lθ(yx)Lθold(yx)\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x) grows with sequence length LL, the naive sequence-ratio

ρraw(y)=exp(Lθ(yx)Lθold(yx))\rho_{\text{raw}}(y) = \exp(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x))

can produce extreme values for long sequences. To remedy this, ESPO normalizes per-token:

ρseq(y)=exp(1L(Lθ(yx)Lθold(yx))).\rho_{\text{seq}}(y) = \exp\left(\frac{1}{L}\left(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\theta_{\text{old}}}(y|x)\right)\right).

This stabilization is critical for practical training. Further, PPO-style clipping is applied to limit ρseq\rho_{\text{seq}} to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] during the surrogate loss calculation, preventing instability from rare large-ratio outliers.

Ablations report that per-token normalization is necessary: without it, ratios explode or vanish, undermining optimization (Ou et al., 3 Dec 2025).

3. Efficient ELBO Estimation and Variance Reduction

Estimating Lθ(yx)\mathcal{L}_\theta(y|x) for each sequence is computationally demanding. The ELBO is defined as an expectation over continuous mask-fractions t[0,1]t \in [0, 1] and random masking patterns ytπt(y)y_t \sim \pi_t(\cdot | y), with

LELBO(yq)=EtU[0,1]Eytπt(y)[1ti=1L1[yti=M]logπθ(yiyt,q)].\mathcal{L}_{\rm ELBO}(y|q) = \mathbb{E}_{t \sim U[0,1]} \mathbb{E}_{y_t \sim \pi_t(\cdot | y)} \left[ \frac{1}{t} \sum_{i=1}^L \mathbf{1}[y_t^i = M]\log \pi_\theta(y^i | y_t, q) \right].

Naive double Monte Carlo estimation suffers from severe variance—time-variance from sampling tt dominates, requiring NK100N K \gg 100 network forward passes per sequence for stable estimates (Rojas et al., 9 Oct 2025).

Semi-deterministic Monte Carlo (SDMC) / Quadrature Scheme: ESPO and its extensions (e.g., Group Diffusion Policy Optimization, GDPO) employ deterministic quadrature for the integral over tt and minimal Monte Carlo sampling over masking, so that

01g(t)dtn=1Nwn1Kk=1KZ(tn,yt[k]),\int_0^1 g(t)dt \approx \sum_{n=1}^N w_n \frac{1}{K}\sum_{k=1}^K Z(t_n, y_t^{[k]}),

with N=2N=2 or $3$ quadrature points and K=1K=1 inner mask sample, leading to dramatically lower estimator variance without additional compute.

Variance Decomposition and Theoretical Guarantees: The total mean squared error for SDMC decomposes into a Monte Carlo variance scaling as O(1/(NK))O(1/(N K)) and a quadrature bias term that rapidly decays with NN (midpoint/trapezoidal: O(1/N4)O(1/N^4), Simpson: O(1/N8)O(1/N^8)). This ensures practical, low-variance estimates with few network evaluations (Rojas et al., 9 Oct 2025).

Estimator Variance Term Bias2^2 Term
Riemann (generic) O(1/(NK))O(1/(N K)) O(1/N2)O(1/N^2)
Smooth/Quadrature O(1/(N2K))O(1/(N^2 K)) O(1/N4)O(1/N^4)

4. Robust KL Regularization

A KL regularizer penalizes deviation of πθ\pi_\theta from a reference πref\pi_{\text{ref}} to maintain conservative policy updates. While exponential-based KL estimators (e.g., k3k_3) can be unstable due to exp()\exp(\cdot) effects for long sequences, ESPO instead applies the quadratic k2k_2 estimator:

KL^k2=Eyπθ[12(Lθ(yx)Lref(yx))2].\widehat{\text{KL}}_{k_2} = \mathbb{E}_{y \sim \pi_\theta}\left[\frac{1}{2}(\mathcal{L}_\theta(y|x) - \mathcal{L}_{\text{ref}}(y|x))^2\right].

This provides unbiased gradients for KL(πθπref)\mathrm{KL}(\pi_\theta \| \pi_\text{ref}), is free of exponentials, and remains robust for extended outputs (Ou et al., 3 Dec 2025).

5. Algorithmic Workflow and Practical Implementation

A typical ESPO (or GDPO) RL iteration proceeds as follows:

  • For each prompt xx, sample GG completions {y(i)}\{y^{(i)}\} from the behavior policy via a diffusion sampler.
  • Compute rewards R(x,y(i))R(x, y^{(i)}) and centered advantages A^(i)\hat{A}^{(i)}.
  • For each y(i)y^{(i)}, estimate Lθ(y(i)x)\mathcal{L}_\theta(y^{(i)}|x) and Lθold(y(i)x)\mathcal{L}_{\theta_{\text{old}}}(y^{(i)}|x) using SDMC quadrature.
  • Compute length-normalized, clipped importance ratios and robust quadratic KL divergence.
  • Formulate the surrogate loss as the mean of clipped-importance-weighted advantages, add the KL penalty, and take a gradient step.

Key training hyperparameters include group size GG (e.g., $6$–$16$), number of quadrature points (N=2N=2–$3$), Monte Carlo masks per tt (K=1K=1–$2$), GRPO clipping parameter ϵ=0.2\epsilon=0.2, and KL penalty weight β=0.02\beta=0.02.

6. Empirical Results and Benchmark Performance

On arithmetic, planning, and code-completion tasks, ESPO demonstrates consistent, often dramatic, improvements over token-level RL baselines and one-step unmasking methods (diffu-GRPO).

Method GSM8K\uparrow MATH\uparrow Countdown\uparrow Sudoku\uparrow HumanEval-avg\uparrow MBPP-avg\uparrow
Base 75.9 37.0 18.7 15.7 37.8 37.8
+d1 (GRPO) 78.0 37.7 33.9 22.2 37.2 36.5
+wd1 80.1 36.9 48.3 23.1 37.2 36.5
+ESPO 82.0 39.5 81.0 86.0 40.1 45.4
Δ vs base +6.1 +2.5 +62.3 +70.3 +2.3 +7.6

These results show especially large gains in planning (Countdown, Sudoku: +62+62 to +70+70 points), and consistent improvements in math and coding, with evaluation at sequence lengths up to L=512L=512, confirming scalability (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025). Sequence-level+ELBO consistently outperforms any token-level proxy (mean-field or token-ELBO).

7. Limitations, Extensions, and Future Directions

  • The practical efficacy of ESPO is closely tied to the tightness of the ELBO as a likelihood surrogate—a plausible implication is that further variance or bias outliers could disrupt optimization for some distributions, though no such breakdown was observed empirically.
  • While ESPO’s convergence is empirically robust, no formal convergence guarantees are established beyond those of standard PPO.
  • Future directions include integrating learned value functions, extending to multimodal diffusion LLMs, and exploring cost reductions via distillation or adaptive masking.
  • For tasks demanding long-range, sequence-level coherence (e.g., planning, full-program synthesis), ESPO’s atomic sequence optimization provides a distinct practical advantage over denoising-token RL methods.

This framework establishes sequence-level RL based on ELBO surrogates as a new paradigm for RL in diffusion LLMs, combining theoretical justification, stable optimization, and large empirical gains across challenging domains (Ou et al., 3 Dec 2025, Rojas et al., 9 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ELBO-based Sequence-level Policy Optimization (ESPO).