Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Published 10 Oct 2024 in cs.LG and cs.CL | (2410.08146v1)

Abstract: A promising approach for improving reasoning in LLMs is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.

Citations (3)

Summary

  • The paper introduces process reward models (PRMs) that provide step-level feedback, outperforming traditional outcome reward models in LLM reasoning.
  • The methodology leverages distinct prover policies to quantify step advantages, achieving significant improvements such as an 8% accuracy gain and enhanced compute efficiency.
  • Empirical results across various model sizes (2B, 9B, 27B) show that using PRMs increases sample efficiency by 5-6 times and boosts overall reasoning performance by over 6%.

Overview of "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning"

The paper "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning" explores the role of process reward models (PRMs) in enhancing LLM reasoning capabilities. The authors focus on how PRMs can provide step-level feedback in multi-step reasoning processes, contrasting them with outcome reward models (ORMs) that deliver feedback only at the conclusion of the reasoning trace.

Key Insights

The primary motivation is to improve reasoning by designing process rewards that measure progress, specifically the change in the likelihood of arriving at a correct response. This is aligned with the concept of step-level advantages in reinforcement learning (RL). The authors argue that progress should be measured under a prover policy distinct from the base model to optimize exploration in search and RL.

Theoretical Contributions

The paper provides a theoretical characterization of effective prover policies, highlighting that even weak prover policies can enhance a stronger base policy. This is quantified through the notion of advantages, which are more effective than QQ-values in facilitating exploration during test time and online RL.

Empirical Results

The authors validate their claims using process advantage verifiers (PAVs) trained with these insights. The results show significant improvements in accuracy and efficiency: test-time search with PAVs is more than 8% more accurate and 1.5 to 5 times more compute-efficient compared to using ORMs alone. Furthermore, online RL using PAVs results in a 5-6 times gain in sample efficiency and over 6% improvement in accuracy.

Methodology

The data collection strategy involves sampling rollouts from a prover policy and estimating QQ-values at each prefix for process reward learning. This methodology is corroborated by experiments using varying sizes of models, including 2B, 9B, and 27B parameter LLMs, showcasing robust improvements in reasoning tasks.

Implications and Future Directions

This research has profound implications for the future of AI, particularly in optimizing reasoning tasks where step-level exploration is crucial. By leveraging distinct prover policies, the strategy of rewarding progress fosters a more diversely exploratory approach in model training and evaluation. Future research could further refine the dynamic selection of prover policies to enhance adaptability in learning environments.

In conclusion, this work presents a compelling argument for rethinking how process rewards are designed in LLMs, advocating for progress-based rewards as a means to optimize task performance and computational efficiency. As AI continues to evolve, these insights could play a pivotal role in shaping the methodologies for training models on complex reasoning tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 21 tweets with 877 likes about this paper.