Applicability of correctness-based bucketing to open-ended tasks
Determine whether the correctness-based bucketing strategy in ExGRPO, which partitions experiences by online rollout correctness under Reinforcement Learning with Verifiable Rewards, can be effectively applied or adapted to open-ended tasks such as creative writing where rewards are subjective and dense, and specify any necessary modifications to handle non-verifiable reward signals.
References
The applicability of our correctness-based bucketing strategy to more open-ended tasks (e.g., creative writing), where rewards are often subjective and dense, remains an open question.
— ExGRPO: Learning to Reason from Experience
(2510.02245 - Zhan et al., 2 Oct 2025) in Appendix, Section: Limitations