Effective scaling of RLVR

Determine effective methods and principles for scaling Reinforcement Learning with Verifiable Rewards (RLVR) to improve the reasoning capabilities of large language models, identifying which scaling axes and training designs yield reliable performance gains.

Background

Reinforcement Learning with Verifiable Rewards (RLVR) has recently driven progress in reasoning for LLMs, but the community lacks clear guidance on how to scale RLVR effectively. Prior work such as ProRL scaled training by increasing steps but encountered plateaus, indicating that step depth alone may be insufficient.

This paper proposes BroRL, which scales the number of rollouts per prompt, and provides theoretical and empirical support for its effectiveness. Nonetheless, the general question of how best to scale RLVR is explicitly identified as open.

References

Yet, how to effectively scale the RLVR paradigm remains an open question.

— BroRL: Scaling Reinforcement Learning via Broadened Exploration (2510.01180 - Hu et al., 1 Oct 2025) in Section 1 (Introduction)

Effective scaling of RLVR

Sponsor

Background

References

Related Problems