This work, "What Makes a Reward Model a Good Teacher? An Optimization Perspective" (Razin et al., 19 Mar 2025 ), examines the properties of reward models (RMs) that contribute to effective Reinforcement Learning from Human Feedback (RLHF) beyond simple pairwise accuracy. It adopts an optimization-centric viewpoint, analyzing how RM characteristics influence the landscape of the RLHF objective function and, consequently, the efficiency of policy gradient optimization used to align LLMs (LMs).
Theoretical Framework: Reward Variance and Optimization Landscape
The standard RLHF objective optimized via policy gradient methods (like PPO) is typically formulated as:
where is the policy (LM) being trained, is a reference policy, is the learned reward model, is the prompt distribution, and controls the KL divergence penalty.
The paper introduces two key properties of the reward model :
- Accuracy: Defined as the probability that the RM correctly ranks pairs of outputs compared to the ground truth preference , i.e., . Accuracy only depends on the relative ordering induced by .
- Reward Variance: Defined as the variance of the rewards assigned by to outputs sampled from the current policy : . This measures the degree to which the RM separates the rewards of outputs likely to be generated by the policy being optimized.
The central theoretical result (Theorem 3.1) establishes a direct link between reward variance and the geometry of the optimization landscape. It proves that if the reward model induces low reward variance for the current policy , then the RLHF objective exhibits a flat landscape around , irrespective of the RM's accuracy.
A flat landscape implies that the gradient has a small norm. The policy gradient is approximately proportional to , where is a baseline, often related to the mean reward. When is low, the reward values for outputs sampled from are highly concentrated around their mean. This significantly diminishes the magnitude of the term, causing the gradient norm to become small. The theorem further extends this to higher-order derivatives, suggesting that the flatness is persistent. Consequently, policy gradient updates become minimal, leading to extremely slow convergence and inefficient optimization. The time required to achieve a certain increase in expected reward is shown to scale inversely with (a power of) the reward variance.
The Disconnect Between Accuracy and Optimization Efficiency
A key insight is the relative independence of accuracy and reward variance. Accuracy pertains to the correctness of pairwise comparisons, while variance relates to the magnitude of reward differences for outputs generated by the current policy .
Theorem 3.2 formalizes this by demonstrating that it is possible to construct two RMs, and , such that:
- is perfectly accurate ( agreement with ) but induces arbitrarily low reward variance.
- has significantly lower accuracy than but induces substantially higher reward variance.
According to Theorem 3.1, optimizing with the perfectly accurate would be extremely slow due to the flat landscape caused by low variance. Conversely, optimizing with the less accurate could lead to much faster initial progress in maximizing the ground truth reward , simply because the optimization process itself is more efficient due to the steeper gradients afforded by higher variance.
This theoretical result provides a formal explanation for empirical observations where deploying RMs with higher accuracy (on benchmark datasets) does not necessarily lead to superior performance of the final LM after RLHF within a fixed training budget. An RM might achieve high accuracy by learning subtle distinctions correctly but fail to assign sufficiently distinct rewards to the types of outputs the policy actually generates, thus failing to provide a strong gradient signal. Conversely, a less accurate RM might provide a clearer, albeit potentially slightly misaligned, gradient that enables faster learning. This highlights a fundamental limitation of evaluating RMs solely based on accuracy metrics.
Policy-Dependence and Contextual RM Evaluation
Theorem 3.3 underscores another critical aspect: reward variance is inherently policy-dependent. Since is calculated based on samples from the policy , the same reward model can induce different levels of variance when paired with different policies (LMs).
Specifically, an RM might induce high variance for an initial policy , leading to efficient optimization. However, the same might induce low variance for a different initial policy if concentrates its probability mass on a region of the output space where assigns very similar rewards. In the latter case, would be a poor "teacher" for , resulting in slow optimization, despite potentially being effective for .
This finding challenges the notion of evaluating RMs in isolation or ranking them universally based on static benchmarks. The effectiveness of an RM appears strongly coupled with the specific LM it is intended to guide. An RM's utility is contextual and depends on its interaction with the policy's output distribution during training. Evaluating RMs "on-policy" (i.e., using outputs generated by the actual LM being trained) is therefore more indicative of their potential effectiveness than "off-policy" evaluation on fixed datasets.
Experimental Corroboration
The paper presents experiments using Pythia and Llama-3.2 models (up to 8B parameters) on datasets like UltraFeedback and AlpacaFarm, employing policy gradient optimization (RLOO/GRPO, variants related to PPO). Key empirical results supporting the theory include:
- Variance Predicts Optimization Rate: A strong positive correlation was observed between the reward variance induced by an RM for the initial policy () and the rate of increase in both the proxy reward and, more importantly, the ground truth reward during training.
- Accuracy is Insufficient: An RM engineered to be perfectly accurate but have low variance resulted in significantly slower improvement in ground truth reward compared to less accurate RMs that induced higher variance. This directly demonstrates that maximizing accuracy alone does not guarantee efficient optimization towards the true objective.
- Proxy RM Can Outperform Ground Truth: Perhaps counter-intuitively, experiments showed scenarios, particularly in the early phases of training, where using a proxy RM led to a faster increase in the ground truth reward than using the ground truth reward itself for optimization. This occurred when the proxy RM induced higher variance than the ground truth reward function for the current policy, thereby facilitating more rapid optimization steps, even if the direction was imperfectly aligned.
- Policy-Dependence Confirmed: The relative performance of different RMs (in terms of final ground truth reward achieved) varied depending on the initial LM used for fine-tuning, confirming the policy-dependent nature of RM effectiveness predicted by Theorem 3.3.
- On-Policy Metrics: Evaluations using on-policy metrics (accuracy and variance computed using samples from the training policy ) showed better correlation with final performance compared to standard off-policy metrics.
Implications for RLHF Practice
The findings carry significant implications for the practical application of RLHF:
- Reward Model Evaluation: Relying solely on accuracy benchmarks (like RewardBench) is insufficient and potentially misleading. Evaluation protocols should incorporate metrics sensitive to the optimization landscape, such as reward variance. Crucially, these metrics should ideally be computed on-policy or relative to the target LM distribution to reflect the actual training dynamics.
- Reward Model Training: Standard RM training objectives focus primarily on maximizing accuracy (e.g., via pairwise logistic loss). The results suggest that incorporating objectives that explicitly encourage higher variance or larger reward margins for outputs likely under the policy distribution might be beneficial. This could involve modifying loss functions or sampling strategies during RM training.
- Monitoring Training Dynamics: Low reward variance can serve as a diagnostic indicator for slow convergence or plateaus during RLHF. Monitoring over the course of training could provide valuable insights into optimization bottlenecks.
- Algorithm Selection: The importance of variance is particularly pronounced for policy gradient methods. For methods like Best-of-N sampling, which rely purely on ranking, accuracy remains the primary determinant of RM quality. This highlights that the definition of a "good" RM may depend on the specific alignment algorithm being used.
Conclusion
In conclusion, this paper provides a theoretical and empirical basis for understanding that reward model effectiveness in RLHF extends beyond pairwise accuracy. Reward variance, measuring the separation of rewards for policy-relevant outputs, plays a critical role in shaping the optimization landscape. Low reward variance, irrespective of accuracy, leads to flat landscapes and slow policy gradient optimization. Furthermore, the policy-dependent nature of variance implies that RM evaluation and selection should consider the specific LM being trained. These insights advocate for a shift in RM evaluation towards optimization-aware and context-dependent metrics.