Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Why is Your Language Model a Poor Implicit Reward Model? (2507.07981v1)

Published 10 Jul 2025 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: Reward models are key to LLM post-training and inference pipelines. Conveniently, recent work showed that every LLM defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a LLM. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and LLM, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Collections

Summary

The paper demonstrates that implicit reward models accurately fit training data but fail to generalize under token-level distribution shifts.
It employs rigorous theoretical analysis and controlled experiments to reveal how reliance on token-level cues undermines semantic generalization.
The findings stress the importance of using explicit reward models for robust language model alignment and stable reinforcement learning.

Generalization Failures of Implicit Reward Models in LLM Alignment

This paper provides a comprehensive theoretical and empirical analysis of the generalization properties of implicit reward models (IM-RMs) versus explicit reward models (EX-RMs) in the context of LLM (LM) alignment. The central claim is that, despite their architectural similarity and identical training data, IM-RMs exhibit a pronounced generalization gap compared to EX-RMs, particularly under token-level distribution shifts. The authors attribute this gap to a fundamental difference in how these models utilize token-level cues versus semantic representations.

Explicit vs. Implicit Reward Models

The distinction between EX-RMs and IM-RMs is operational rather than architectural. Both are initialized from the same LM and trained on the same preference data using the same loss (typically Bradley-Terry). The difference lies in reward computation:

EX-RM: Applies a learned linear head to the hidden representation of the prompt-response pair.
IM-RM: Defines the reward as the (scaled) log-likelihood ratio of the response under the current LM versus a reference LM.

This difference, though seemingly minor, leads to substantial divergence in generalization behavior.

Theoretical Analysis

The authors rigorously analyze the learning dynamics of both reward model types under gradient-based training, assuming fixed hidden representations. For EX-RMs, reward updates depend solely on the similarity of hidden representations, which are known to encode semantic information. In contrast, IM-RMs' reward updates are sensitive to the specific tokens in the response, not just their semantic content. The analysis shows that, for IM-RMs, increasing the reward for a response may not increase (and can even decrease) the reward for a semantically similar response with different surface tokens.

A key theoretical result is that IM-RMs can fit the training data perfectly but fail to generalize to responses with unseen tokens, achieving only chance-level accuracy on such examples. EX-RMs, by contrast, can generalize to unseen tokens if the hidden representations are well-structured, as is typical in modern LMs.

Empirical Results

The empirical section substantiates the theoretical claims with controlled and real-world experiments:

Controlled Setting: On a synthetic task (e.g., Hamiltonian cycle verification), IM-RMs can learn to verify correct responses without being able to generate them, refuting the hypothesis that IM-RMs' generalization gap is due to a generation-verification tradeoff.
Token-Level Shift: When trained on a set of responses and evaluated on paraphrased versions, EX-RMs maintain perfect accuracy, while IM-RMs' accuracy drops to near zero.
Real-World Datasets: Across UltraFeedback, RewardBench, and RewardMATH, IM-RMs consistently underperform EX-RMs under token-level shifts (paraphrasing, translation), but perform comparably or better under domain shifts (e.g., from general chat to code/math).

Notably, EX-RMs also induce a higher reward margin, which is beneficial for downstream reinforcement learning optimization.

Strong Claims and Contradictions

IM-RMs' Generalization Gap Is Not Due to Generation Difficulty: The paper provides both theoretical and empirical evidence that IM-RMs do not need to learn to generate correct responses to act as effective verifiers.
IM-RMs Are Inherently Sensitive to Token-Level Cues: The analysis and experiments demonstrate that IM-RMs' reliance on token-level statistics, rather than semantic representations, is the root cause of their brittleness to surface-level shifts.
Minor Design Choices Have Major Impact: The results highlight that the choice between EX-RM and IM-RM, though operationally subtle, has significant consequences for generalization and robustness.

Practical Implications

For practitioners designing reward models for LM alignment, the findings have immediate consequences:

Prefer EX-RMs for Robustness: When robustness to paraphrasing, translation, or other token-level shifts is required, EX-RMs are preferable due to their reliance on semantic representations.
IM-RMs May Be Acceptable for Domain Shifts: In scenarios where the primary concern is domain shift (e.g., from general chat to code), IM-RMs may perform comparably or even better.
Reward Margin Matters: EX-RMs' higher reward margin can facilitate more stable and effective reinforcement learning, as low reward variance can lead to vanishing gradients.

Implementation Considerations

Training: Both EX-RMs and IM-RMs can be implemented using standard LM architectures with minimal code changes. For EX-RMs, add a linear head; for IM-RMs, use the LM's log-likelihood as the reward.
Evaluation: To assess generalization, include paraphrased or translated responses in the test set, not just in-domain or domain-shifted data.
Scaling: The observed trends hold across model sizes (1B–8B parameters) and across multiple LM families (Llama, Qwen, Gemma, Pythia).

Limitations and Future Directions

The theoretical analysis assumes fixed hidden representations and, in some cases, single-token responses. While the empirical results suggest the conclusions extend to more realistic settings, further work could relax these assumptions. Additionally, the paper focuses on accuracy as the primary metric; future work could explore other reward model properties, such as calibration, reward hacking resistance, and impact on downstream RLHF.

Broader Implications and Future Developments

This work underscores the importance of understanding the implicit biases introduced by reward model parameterization. As LMs are increasingly deployed in safety-critical and user-facing applications, the robustness of reward models to distribution shifts becomes paramount. The findings suggest that even subtle design choices can have outsized effects on alignment outcomes. Future research may explore hybrid reward models, regularization strategies to mitigate token-level overfitting in IM-RMs, or new architectures that combine the strengths of both approaches.

In summary, this paper provides a rigorous and actionable analysis of why implicit reward models are less robust than explicit ones, offering clear guidance for both researchers and practitioners in the design and evaluation of reward models for LLM alignment.