Next stage of scaling RL for LLMs

Determine the next stage of scaling reinforcement learning for large language models, specifically assessing whether open-ended reinforcement learning constitutes a viable and effective direction for continued capability growth.

Background

The survey contrasts RLHF/DPO for alignment with the emerging RLVR trend that has enabled substantial reasoning improvements. While RLVR has demonstrated success, the authors emphasize uncertainty about how to further scale RL for LLMs and point to open-ended RL as a potentially promising but challenging path. This problem seeks to clarify strategic directions for scaling beyond current practices.

References

The next stage of scaling RL for LLMs remains an open question, with open-ended RL presenting a particularly challenging and promising direction.

A Survey of Reinforcement Learning for Large Reasoning Models (2509.08827 - Zhang et al., 10 Sep 2025) in Figure rl_evol caption