Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards (2505.02686v2)

Published 5 May 2025 in cs.CL

Abstract: Recent developments in LLMs have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-LLM-papers.

Summary

Learning from Rewards: A Comprehensive Review

This paper provides a systematic overview of the emerging paradigm in LLMs—learning from rewards—emphasizing its significance in post-training and test-time scaling. The trajectory towards more intelligent agents has evolved from pre-training on massive datasets to continuously refining model behaviors through iterative interactions with dynamically sourced feedback signals. The authors categorize strategies under this paradigm across three distinct stages of model development: training, inference, and post-inference. This delineation reveals the nuanced methodologies employed to enhance LLM capabilities, aligning them closely with human preferences and tasks requiring deep reasoning.

Training with Rewards

The training phase of LLMs is revolutionized by integrating reward signals, where traditional reinforcement learning methods like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are employed to fine-tune model parameters. The paper outlines several reward sources:

Human Feedback: Leveraging direct human preferences, RLHF is a cornerstone technique, offering scalar rewards for alignment and adaptation to desired output attributes such as harmlessness and helpfulness.
Automated Feedback: This includes self-evaluation by LLMs and assessments from trained external models, reducing reliance on costly human annotations and improving scalability.
Rule-based and Process Rewards: New paradigms such as DeepSeek-R1 illustrate the effective use of rule-based rewards for fine-grained reasoning capabilities, thus showcasing significant improvements in solving complex tasks.

Inference with Rewards

At the inference stage, the paper highlights strategies for adapting outputs without altering model parameters:

Generate-then-Rank: Here, models leverage multiple candidate responses and use reward models to rank outputs, effectively scaling response quality by reducing computational overhead through optimized candidate generation processes.
Reward-Guided Decoding: By integrating reward signals at either a token-level or step-level during generation, models can dynamically steer their outputs, thereby fostering capabilities notably evident in complex reasoning tasks.

Post-Inference with Rewards

The post-inference stage focuses on refining model outputs:

Self-Correction: LLMs iteratively refine their outputs using their generated critiques to align better with desired accuracy and coherence.
External Feedback Mechanisms: Critiques from more capable models, external knowledge bases, and tools drive iterative refinement, particularly in specialized domains requiring high factual reliability and reduced hallucination.

Benchmarking and Applications

Given the centrality of reward models in this paradigm, robust benchmarks are indispensable. The paper delineates benchmarks based on annotation protocols and task diversity, providing the infrastructure for evaluating reward model efficacy across tasks like coding, reasoning, and multimodal interactions. Such efforts underscore the broad applicability of learning-from-rewards strategies, extending from math problem-solving and code generation to generalist AI applications where dynamic environmental interactions are crucial.

Challenges and Future Directions

The paper identifies challenges such as reward model interpretability, the risk of reward hacking, and the necessity for grounded real-world rewards. It advocates for future research into developing generalist reward models capable of cross-task generalization and continual learning, addressing catastrophic forgetting and evolving preferences.

Conclusion

Overall, this survey serves as a guiding compendium for researchers, illuminating the nuanced methodologies and broad implications of learning-from-rewards strategies in advancing LLM capabilities. By meticulously cataloging existing literature and projecting future trends, it foregrounds the transformative potential of reward-centric AI systems.

Related Papers

Tweets

https://twitter.com/theomitsa/status/1922673788883263580

https://twitter.com/theomitsa/status/1922673737188479117

https://twitter.com/GptMaestro/status/1925808695864156321

https://twitter.com/javaeeeee1/status/1921882925546533048