Effectiveness of Self-Proposed Rubric Rewards in Non-Verifiable Domains
Determine the effectiveness of RLCER—reinforcement learning with chain-of-thought supervision via self-evolving rubrics—for reasoning tasks in non-verifiable domains where automatic final-answer verifiers are unavailable, specifically assessing whether self-proposed rubric rewards provide useful learning signals and yield performance improvements.
References
On the other hand, our method is still quite limited to the RLVR domain, leaving the effectiveness of rewarding with self-proposed rubrics on non-verifiable domains unknown.
— Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
(2602.10885 - Sheng et al., 11 Feb 2026) in Section: Limitations