ELBO estimation for diffusion language models in on-policy RL
Develop efficient and accurate estimators of the Evidence Lower Bound (ELBO) for diffusion large language models that are suitable for on-policy reinforcement learning, minimizing variance while preserving correctness.
Sponsor
References
efficient and accurate ELBO estimation remains an open problem for on-policy learning.
— A Survey of Reinforcement Learning for Large Reasoning Models
(2509.08827 - Zhang et al., 10 Sep 2025) in Section 7.6 RL for Diffusion-based LLMs