Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Makes a Reward Model a Good Teacher? An Optimization Perspective (2503.15477v1)

Published 19 Mar 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one LLM can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the LLM they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

This work, "What Makes a Reward Model a Good Teacher? An Optimization Perspective" (Razin et al., 19 Mar 2025 ), examines the properties of reward models (RMs) that contribute to effective Reinforcement Learning from Human Feedback (RLHF) beyond simple pairwise accuracy. It adopts an optimization-centric viewpoint, analyzing how RM characteristics influence the landscape of the RLHF objective function and, consequently, the efficiency of policy gradient optimization used to align LLMs (LMs).

Theoretical Framework: Reward Variance and Optimization Landscape

The standard RLHF objective optimized via policy gradient methods (like PPO) is typically formulated as:

J(θ)=ExD,yπθ(x)[r^(y)]βKL(πθ(x)πref(x))J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}[\hat{r}(y)] - \beta \text{KL}(\pi_\theta(\cdot|x) || \pi_{ref}(\cdot|x))

where πθ\pi_\theta is the policy (LM) being trained, πref\pi_{ref} is a reference policy, r^\hat{r} is the learned reward model, D\mathcal{D} is the prompt distribution, and β\beta controls the KL divergence penalty.

The paper introduces two key properties of the reward model r^\hat{r}:

  1. Accuracy: Defined as the probability that the RM correctly ranks pairs of outputs (y,y)(y, y') compared to the ground truth preference rgtr_{gt}, i.e., P(sign(r^(y)r^(y))=sign(rgt(y)rgt(y)))P( \text{sign}(\hat{r}(y) - \hat{r}(y')) = \text{sign}(r_{gt}(y) - r_{gt}(y')) ). Accuracy only depends on the relative ordering induced by r^\hat{r}.
  2. Reward Variance: Defined as the variance of the rewards assigned by r^\hat{r} to outputs yy sampled from the current policy πθ\pi_\theta: Varyπθ(r^(y))\text{Var}_{y \sim \pi_\theta}(\hat{r}(y)). This measures the degree to which the RM separates the rewards of outputs likely to be generated by the policy being optimized.

The central theoretical result (Theorem 3.1) establishes a direct link between reward variance and the geometry of the optimization landscape. It proves that if the reward model r^\hat{r} induces low reward variance Varyπθ(r^(y))\text{Var}_{y \sim \pi_\theta}(\hat{r}(y)) for the current policy πθ\pi_\theta, then the RLHF objective J(θ)J(\theta) exhibits a flat landscape around θ\theta, irrespective of the RM's accuracy.

A flat landscape implies that the gradient J(θ)\nabla J(\theta) has a small norm. The policy gradient is approximately proportional to Eyπθ[(r^(y)b)θlogπθ(y)]\mathbb{E}_{y \sim \pi_\theta}[ (\hat{r}(y) - b) \nabla_\theta \log \pi_\theta(y) ], where bb is a baseline, often related to the mean reward. When Varyπθ(r^(y))\text{Var}_{y \sim \pi_\theta}(\hat{r}(y)) is low, the reward values r^(y)\hat{r}(y) for outputs yy sampled from πθ\pi_\theta are highly concentrated around their mean. This significantly diminishes the magnitude of the (r^(y)b)(\hat{r}(y) - b) term, causing the gradient norm J(θ)||\nabla J(\theta)|| to become small. The theorem further extends this to higher-order derivatives, suggesting that the flatness is persistent. Consequently, policy gradient updates become minimal, leading to extremely slow convergence and inefficient optimization. The time required to achieve a certain increase in expected reward is shown to scale inversely with (a power of) the reward variance.

The Disconnect Between Accuracy and Optimization Efficiency

A key insight is the relative independence of accuracy and reward variance. Accuracy pertains to the correctness of pairwise comparisons, while variance relates to the magnitude of reward differences for outputs generated by the current policy πθ\pi_\theta.

Theorem 3.2 formalizes this by demonstrating that it is possible to construct two RMs, r^1\hat{r}_1 and r^2\hat{r}_2, such that:

  • r^1\hat{r}_1 is perfectly accurate (100%100\% agreement with rgtr_{gt}) but induces arbitrarily low reward variance.
  • r^2\hat{r}_2 has significantly lower accuracy than r^1\hat{r}_1 but induces substantially higher reward variance.

According to Theorem 3.1, optimizing with the perfectly accurate r^1\hat{r}_1 would be extremely slow due to the flat landscape caused by low variance. Conversely, optimizing with the less accurate r^2\hat{r}_2 could lead to much faster initial progress in maximizing the ground truth reward rgtr_{gt}, simply because the optimization process itself is more efficient due to the steeper gradients afforded by higher variance.

This theoretical result provides a formal explanation for empirical observations where deploying RMs with higher accuracy (on benchmark datasets) does not necessarily lead to superior performance of the final LM after RLHF within a fixed training budget. An RM might achieve high accuracy by learning subtle distinctions correctly but fail to assign sufficiently distinct rewards to the types of outputs the policy actually generates, thus failing to provide a strong gradient signal. Conversely, a less accurate RM might provide a clearer, albeit potentially slightly misaligned, gradient that enables faster learning. This highlights a fundamental limitation of evaluating RMs solely based on accuracy metrics.

Policy-Dependence and Contextual RM Evaluation

Theorem 3.3 underscores another critical aspect: reward variance is inherently policy-dependent. Since Varyπθ(r^(y))\text{Var}_{y \sim \pi_\theta}(\hat{r}(y)) is calculated based on samples from the policy πθ\pi_\theta, the same reward model r^\hat{r} can induce different levels of variance when paired with different policies (LMs).

Specifically, an RM r^\hat{r} might induce high variance for an initial policy πθ1\pi_{\theta_1}, leading to efficient optimization. However, the same r^\hat{r} might induce low variance for a different initial policy πθ2\pi_{\theta_2} if πθ2\pi_{\theta_2} concentrates its probability mass on a region of the output space where r^\hat{r} assigns very similar rewards. In the latter case, r^\hat{r} would be a poor "teacher" for πθ2\pi_{\theta_2}, resulting in slow optimization, despite potentially being effective for πθ1\pi_{\theta_1}.

This finding challenges the notion of evaluating RMs in isolation or ranking them universally based on static benchmarks. The effectiveness of an RM appears strongly coupled with the specific LM it is intended to guide. An RM's utility is contextual and depends on its interaction with the policy's output distribution during training. Evaluating RMs "on-policy" (i.e., using outputs generated by the actual LM being trained) is therefore more indicative of their potential effectiveness than "off-policy" evaluation on fixed datasets.

Experimental Corroboration

The paper presents experiments using Pythia and Llama-3.2 models (up to 8B parameters) on datasets like UltraFeedback and AlpacaFarm, employing policy gradient optimization (RLOO/GRPO, variants related to PPO). Key empirical results supporting the theory include:

  • Variance Predicts Optimization Rate: A strong positive correlation was observed between the reward variance induced by an RM for the initial policy (Varyπref(r^(y))\text{Var}_{y \sim \pi_{ref}}(\hat{r}(y))) and the rate of increase in both the proxy reward E[r^(y)]\mathbb{E}[\hat{r}(y)] and, more importantly, the ground truth reward E[rgt(y)]\mathbb{E}[r_{gt}(y)] during training.
  • Accuracy is Insufficient: An RM engineered to be perfectly accurate but have low variance resulted in significantly slower improvement in ground truth reward compared to less accurate RMs that induced higher variance. This directly demonstrates that maximizing accuracy alone does not guarantee efficient optimization towards the true objective.
  • Proxy RM Can Outperform Ground Truth: Perhaps counter-intuitively, experiments showed scenarios, particularly in the early phases of training, where using a proxy RM r^\hat{r} led to a faster increase in the ground truth reward E[rgt(y)]\mathbb{E}[r_{gt}(y)] than using the ground truth reward rgtr_{gt} itself for optimization. This occurred when the proxy RM induced higher variance than the ground truth reward function for the current policy, thereby facilitating more rapid optimization steps, even if the direction was imperfectly aligned.
  • Policy-Dependence Confirmed: The relative performance of different RMs (in terms of final ground truth reward achieved) varied depending on the initial LM used for fine-tuning, confirming the policy-dependent nature of RM effectiveness predicted by Theorem 3.3.
  • On-Policy Metrics: Evaluations using on-policy metrics (accuracy and variance computed using samples from the training policy πθ\pi_\theta) showed better correlation with final performance compared to standard off-policy metrics.

Implications for RLHF Practice

The findings carry significant implications for the practical application of RLHF:

  1. Reward Model Evaluation: Relying solely on accuracy benchmarks (like RewardBench) is insufficient and potentially misleading. Evaluation protocols should incorporate metrics sensitive to the optimization landscape, such as reward variance. Crucially, these metrics should ideally be computed on-policy or relative to the target LM distribution to reflect the actual training dynamics.
  2. Reward Model Training: Standard RM training objectives focus primarily on maximizing accuracy (e.g., via pairwise logistic loss). The results suggest that incorporating objectives that explicitly encourage higher variance or larger reward margins for outputs likely under the policy distribution might be beneficial. This could involve modifying loss functions or sampling strategies during RM training.
  3. Monitoring Training Dynamics: Low reward variance can serve as a diagnostic indicator for slow convergence or plateaus during RLHF. Monitoring Varyπθ(r^(y))\text{Var}_{y \sim \pi_\theta}(\hat{r}(y)) over the course of training could provide valuable insights into optimization bottlenecks.
  4. Algorithm Selection: The importance of variance is particularly pronounced for policy gradient methods. For methods like Best-of-N sampling, which rely purely on ranking, accuracy remains the primary determinant of RM quality. This highlights that the definition of a "good" RM may depend on the specific alignment algorithm being used.

Conclusion

In conclusion, this paper provides a theoretical and empirical basis for understanding that reward model effectiveness in RLHF extends beyond pairwise accuracy. Reward variance, measuring the separation of rewards for policy-relevant outputs, plays a critical role in shaping the optimization landscape. Low reward variance, irrespective of accuracy, leads to flat landscapes and slow policy gradient optimization. Furthermore, the policy-dependent nature of variance implies that RM evaluation and selection should consider the specific LM being trained. These insights advocate for a shift in RM evaluation towards optimization-aware and context-dependent metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Noam Razin (15 papers)
  2. Zixuan Wang (82 papers)
  3. Hubert Strauss (2 papers)
  4. Stanley Wei (4 papers)
  5. Jason D. Lee (151 papers)
  6. Sanjeev Arora (93 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com