Dice Question Streamline Icon: https://streamlinehq.com

Applicability of correctness-based bucketing to open-ended tasks

Determine whether the correctness-based bucketing strategy in ExGRPO, which partitions experiences by online rollout correctness under Reinforcement Learning with Verifiable Rewards, can be effectively applied or adapted to open-ended tasks such as creative writing where rewards are subjective and dense, and specify any necessary modifications to handle non-verifiable reward signals.

Information Square Streamline Icon: https://streamlinehq.com

Background

ExGRPO prioritizes experiences using a correctness-based bucketing strategy that relies on verifiable outcome rewards, making it naturally suited for tasks like mathematical and general reasoning where answers can be automatically checked.

The paper evaluates ExGRPO on verifiable benchmarks and highlights that applying the same correctness-driven mechanism to open-ended tasks (e.g., creative writing) is non-trivial because such tasks typically involve subjective, dense rewards rather than binary verifiable outcomes.

The authors explicitly note that whether this strategy transfers to open-ended settings is unresolved, motivating investigation into adaptations that could make correctness-based bucketing applicable beyond verifiable reward regimes.

References

The applicability of our correctness-based bucketing strategy to more open-ended tasks (e.g., creative writing), where rewards are often subjective and dense, remains an open question.

ExGRPO: Learning to Reason from Experience (2510.02245 - Zhan et al., 2 Oct 2025) in Appendix, Section: Limitations