Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 61 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 171 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training (2509.21500v1)

Published 25 Sep 2025 in cs.LG and cs.AI

Abstract: Reinforcement fine-tuning (RFT) often suffers from \emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .

Summary

The paper demonstrates that slight misranking in the high-reward tail dramatically undermines win rate and expected reward in LLM post-training.
The methodology involves iterative, LLM-driven rubric refinement using off-policy exemplars to effectively differentiate among high-quality responses.
Empirical results show that rubrics based on great and diverse responses significantly delay reward over-optimization, yielding robust performance improvements.

Rubric-Based Reward Modeling for LLM Post-Training: Theory and Practice

Overview

This paper addresses the persistent challenge of reward over-optimization in reinforcement fine-tuning (RFT) of LLMs. The authors provide a theoretical and empirical analysis demonstrating that the primary driver of over-optimization is reward misspecification in the high-reward (tail) region, i.e., the inability of reward models to reliably distinguish between excellent and merely great responses. To mitigate this, the paper proposes a rubric-based reward modeling framework that leverages off-policy exemplars and iterative rubric refinement, focusing on discriminating among high-quality, diverse responses. The approach is shown to substantially delay and reduce reward over-optimization, yielding more robust post-training improvements.

Figure 1: Chasing the Tail with Rubric-Based Rewards. The schematic illustrates the focus on the high-reward tail and the iterative refinement process.

Theoretical Analysis: High-Reward Tail Dominance

The authors formalize the RFT objective as maximizing expected reward under a KL constraint from the base model, with the reward model $r$ serving as a proxy for the inaccessible ground-truth $r^*$ . They introduce a misspecification mapping $f$ from $r^*$ to $r$ and derive closed-form expressions for expected reward, win rate, and KL divergence under this mapping. The key theoretical result is that, for Pareto-optimal post-training procedures, the impact of reward misspecification on downstream performance is exponentially concentrated in the high-reward region. Specifically, errors in ranking or scoring the top responses dominate the degradation in win rate and expected reward, while the KL divergence remains invariant to $f$ .

Figure 2: Win rate with reward misspecification. Performance collapses when the proxy reward is inaccurate in the high-value region, but is robust if the top responses are correctly ranked.

The analysis further shows that if the reward model correctly ranks only a small fraction (e.g., top 10%) of responses, near-optimal performance is achievable. Conversely, misranking the high-reward tail leads to rapid collapse in win rate as optimization pressure increases.

Rubric-Based Reward Modeling: Principles and Workflow

Rubric-based reward modeling (RLRR) is proposed as a solution to the scarcity of high-reward training examples and the brittleness of standard reward models to off-policy data. In RLRR, each prompt is associated with a set of explicit, weighted binary criteria (the rubric), and a verifier LLM scores each response against these criteria. The total reward is the weighted average of satisfied criteria.

The paper introduces two principles for effective rubric construction:

Distinguish excellent from great responses: Rubrics must be refined to separate the very best responses from those that are merely good.
Distinguish among diverse off-policy responses: Rubrics should be iteratively refined using a diverse pool of high-quality responses from multiple strong models.

The authors operationalize these principles via an iterative refinement workflow. At each step, the current rubric is used to score a pool of candidate responses; the top responses are compared, and a proposer LLM is prompted to identify distinguishing features, which are encoded as new or refined rubric criteria. This process is repeated, progressively sharpening the rubric's discriminative power in the high-reward tail.

Figure 3: Single-round Improvement. A proposer LLM analyzes a pair of responses to identify distinguishing features and encodes them as new rubric criteria.

Empirical Evaluation

Experiments are conducted in both generalist and health domains, using Qwen3-8B-Base as the base model and GRPO as the RFT algorithm. Rubrics are constructed and refined using GPT-4.1 as the proposer and verifier. Candidate responses are drawn from a diverse set of strong LLMs, including Gemini 2.5 Pro and others.

Key findings include:

Rubrics refined with great pairs outperform those refined with good pairs, confirming Principle 1.
Iterative refinement with multiple diverse great pairs yields further gains, confirming Principle 2.
Reward over-optimization is substantially delayed: Models trained with iteratively refined, diverse rubrics sustain higher win rates for longer training durations, with over-optimization occurring much later compared to baselines.
Figure 4: Refinement with great and diverse responses mitigates reward over-optimization. Only models trained with iteratively refined, diverse rubrics sustain higher win-rates and delay the collapse that signals reward over-optimization.

Quantitatively, win rates and domain-specific metrics (e.g., HealthBench) improve monotonically with the quality and diversity of rubric refinement. Rubric-based rewards also show higher agreement with ground-truth judges in the high-reward region, directly linking rubric accuracy in the tail to downstream RL performance.

Analysis of rubric refinement patterns reveals that great candidate responses drive more sophisticated rubric improvements, such as breaking down complex criteria, enhancing verification standards, and defining explicit scope or constraints. In contrast, good responses tend to drive basic corrections or broaden criteria. This supports the claim that high-quality, diverse exemplars are essential for eliciting rubrics that can reliably distinguish the best outputs.

Implications and Future Directions

The findings have several important implications:

Reward model design for LLM alignment should prioritize accuracy in the high-reward tail. Global reward model metrics are insufficient; targeted evaluation and refinement in the tail are critical.
Rubric-based rewards offer a scalable, interpretable, and sample-efficient alternative to preference-based reward models, especially in domains where large-scale high-quality preference data is unavailable.
Iterative, LLM-driven rubric refinement is effective for leveraging off-policy data without overfitting to superficial artifacts.

Potential future directions include optimizing the aggregation of rubric scores (beyond simple weighted averages), automating the selection of candidate pools for refinement, and extending the approach to more open-ended or creative tasks.

Conclusion

This work provides a rigorous theoretical and empirical foundation for rubric-based reward modeling in LLM post-training. By focusing on the high-reward tail and leveraging iterative, LLM-driven rubric refinement with diverse, high-quality responses, the approach effectively mitigates reward over-optimization and yields robust alignment improvements. The results suggest that careful reward model design—centered on discriminating among the best outputs—is essential for scalable and reliable LLM alignment.