ELBO estimation for diffusion language models in on-policy RL

Develop efficient and accurate estimators of the Evidence Lower Bound (ELBO) for diffusion large language models that are suitable for on-policy reinforcement learning, minimizing variance while preserving correctness.

Background

Applying RL to diffusion LLMs is challenging because they optimize ELBO rather than simple chain-rule likelihoods. Existing approaches suffer from high variance and inefficiency. The authors explicitly flag the need for robust ELBO estimation tailored to on-policy learning as an open problem.

References

efficient and accurate ELBO estimation remains an open problem for on-policy learning.

A Survey of Reinforcement Learning for Large Reasoning Models (2509.08827 - Zhang et al., 10 Sep 2025) in Section 7.6 RL for Diffusion-based LLMs