Why is Your Language Model a Poor Implicit Reward Model? (2507.07981v1)

Published 10 Jul 2025 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: Reward models are key to LLM post-training and inference pipelines. Conveniently, recent work showed that every LLM defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a LLM. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and LLM, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Summary

The paper demonstrates that implicit reward models (IM-RMs) underperform explicit reward models (EX-RMs) due to their reliance on token-level cues rather than semantic similarity.
It employs rigorous theoretical proofs and controlled experiments, revealing how IM-RMs fail to generalize to paraphrased and translated responses while EX-RMs maintain accuracy.
The findings highlight the importance of model design choices for robust LM alignment, guiding practitioners in selecting reward models for effective RL-based fine-tuning.

Analysis of "Why is Your LLM a Poor Implicit Reward Model?" (2507.07981)

This paper provides a rigorous theoretical and empirical investigation into the generalization properties of two prevalent reward model types used in LLM (LM) post-training: explicit reward models (EX-RMs) and implicit reward models (IM-RMs). The central claim is that, despite their architectural similarity, IM-RMs generalize substantially worse than EX-RMs, particularly under token-level distribution shifts. The authors attribute this gap to a fundamental difference in how these models leverage token-level cues versus semantic representations.

Problem Setting and Motivation

Reward models are essential for aligning LMs with human preferences, both in reinforcement learning from human feedback (RLHF) and in direct preference optimization (DPO) pipelines. Two main approaches exist:

EX-RMs: Apply a trainable linear head to the hidden representation of a prompt-response pair.
IM-RMs: Use the log-likelihood of the response under the LM (possibly normalized by a reference model) as the reward.

Both models can be trained on the same data and with the same loss, differing only in how the reward is computed. Prior empirical work has observed that IM-RMs tend to generalize worse, especially out-of-distribution, but the underlying cause was not well understood.

Theoretical Contributions

The authors systematically analyze the learning dynamics of EX-RMs and IM-RMs, focusing on how a gradient update on a training example affects the reward assigned to unseen prompt-response pairs. Their key theoretical findings are:

EX-RMs: The reward update depends on the similarity of hidden representations. If the hidden representations encode semantic similarity, EX-RMs can generalize to paraphrases and other surface-level variations.
IM-RMs: The reward update is sensitive to the specific tokens in the response, not just their semantic content. This means that IM-RMs can fail to generalize to paraphrases or translations, even if the underlying meaning is preserved.

A formal result demonstrates that, under fixed hidden representations and single-token responses, IM-RMs cannot generalize to unseen tokens, while EX-RMs can generalize if the hidden representations are well-structured.

The paper also refutes the hypothesis that IM-RMs underperform because they must learn to generate correct responses (not just verify them). Through both theory and a Hamiltonian cycle verification experiment, the authors show that IM-RMs can act as verifiers without being able to generate correct responses.

Empirical Results

The empirical section is comprehensive, spanning both controlled and real-world settings:

Controlled Experiments: On a Persona dataset, EX-RMs achieve perfect accuracy on both original and paraphrased responses, while IM-RMs' accuracy drops to near zero on paraphrased responses, despite perfect in-distribution performance.
Real-World Experiments: Across multiple LMs (1B–8B scale) and datasets (UltraFeedback, RewardMATH, RewardBench), IM-RMs consistently underperform EX-RMs under token-level shifts (paraphrasing, translation), but perform comparably or better under domain shifts (e.g., from chat to code/math).
Reward Margin: EX-RMs induce a higher absolute reward margin, which is beneficial for downstream RL optimization.

The results are robust across model scales, datasets, and hyperparameters. The authors also show that alternative explanations (e.g., dependence on intermediate token representations or reference distributions) do not account for the observed generalization gap.

Implications

The findings have several important implications for the design and deployment of reward models in LM alignment:

Robustness to Surface Variation: For applications where robustness to paraphrasing, translation, or other token-level shifts is critical, EX-RMs are preferable.
Reward Model Selection: The choice between EX-RM and IM-RM should be informed by the expected distribution shifts at deployment time. IM-RMs may be acceptable or even advantageous under certain domain shifts, but are brittle to token-level changes.
Optimization Landscape: The higher reward margin of EX-RMs improves the optimization landscape for RL-based fine-tuning, potentially leading to more stable and effective policy updates.
Design Choices Matter: Seemingly minor architectural decisions (e.g., how the reward is computed) can have significant effects on generalization and robustness.

Future Directions

The paper suggests several avenues for further research:

Beyond Fixed Representations: While the theory assumes fixed hidden representations, empirical results indicate the conclusions hold when all parameters are trained. A deeper theoretical understanding in the fully end-to-end setting is warranted.
Other Reward Model Types: The analysis could be extended to generative reward models and models that provide feedback on intermediate steps.
Broader Evaluation Metrics: Accuracy is not the only relevant metric for reward model quality; future work should consider other criteria such as calibration, robustness to adversarial inputs, and impact on downstream RL performance.
Cases Favoring IM-RMs: Investigating scenarios where IM-RMs may outperform EX-RMs, particularly under domain shifts, could yield insights into their potential advantages.

Conclusion

This work provides a clear theoretical and empirical account of why IM-RMs are less robust than EX-RMs to token-level distribution shifts. The results underscore the importance of aligning reward model architecture with the intended deployment scenario and highlight the need for careful evaluation of generalization properties in reward modeling for LLM alignment. The analysis and methodology set a strong foundation for future research on the implicit biases and robustness of reward models in large-scale language systems.

PDF Markdown

Related Papers

Improving Reward Models with Synthetic Critiques (2024)
Reward-Robust RLHF in LLMs (2024)
RRM: Robust Reward Model Training Mitigates Reward Hacking (2024)
What Makes a Reward Model a Good Teacher? An Optimization Perspective (2025)
RM-R1: Reward Modeling as Reasoning (2025)

Tweets

https://twitter.com/QuixiAI/status/1944452259301179624

https://twitter.com/noamrazin/status/1943679031003463937

https://twitter.com/rosinality/status/1943679172523778271

YouTube

Show All Videos