Whether and how RLRR alleviates reward over-optimization

Determine whether and how reinforcement learning from rubrics-based reward (RLRR) mitigates reward over-optimization in large language model post-training, where reward over-optimization refers to a policy exploiting misspecified proxy rewards to achieve high scores while true output quality deteriorates.

Background

The paper studies rubric-based rewards as an approach to post-training LLMs and contrasts them with traditional Bradley–Terry preference-based reward models, which are known to suffer from reward over-optimization. In RLRR, each prompt is associated with explicit criteria and weights, and a verifier assesses whether a response satisfies each criterion, producing a weighted reward.

Despite growing interest in RLRR, the authors note that it is not yet established whether rubric-based rewards actually alleviate the reward over-optimization phenomenon in practice, motivating their theoretical and empirical investigation into high-reward region accuracy and rubric construction methods.

References

However, it's still unclear if, and how, RLRR alleviates reward over-optimization.

— Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training (2509.21500 - Zhang et al., 25 Sep 2025) in Section 2 (Preliminaries), paragraph "Reinforcement learning from rubrics-based reward"

Whether and how RLRR alleviates reward over-optimization

Background

References

Related Problems