Free Process Rewards without Process Labels (2412.01981v1)

Published 2 Dec 2024 in cs.LG and cs.CL

Abstract: Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an \textit{implicit PRM} can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline \textit{\'a la} Math-Shepherd using less than $1/38$ of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.

Authors (9)

Lifan Yuan (22 papers)
Wendi Li (11 papers)
Huayu Chen (19 papers)
Ganqu Cui (39 papers)
Ning Ding (122 papers)
Kaiyan Zhang (33 papers)
Bowen Zhou (141 papers)
Zhiyuan Liu (433 papers)
Hao Peng (291 papers)

Citations (1)

View on Semantic Scholar

Summary

Implicit Process Reward Models: A Cost-Effective Approach

The paper presents a novel approach to training Process Reward Models (PRMs) efficiently by leveraging an implicit parameterization method within Outcome Reward Models (ORMs). This method foregoes the significant challenges associated with annotating intermediate steps in PRM training, which typically require labor-intensive data collection for each reasoning step.

Core Contributions

Implicit PRM Derivation: The authors propose a parameterization that employs the log-likelihood ratios of the policy and reference models. This approach enables the implementation of implicit PRMs that utilize response-level labels to train ORMs, thereby bypassing the need for fine-grained process steps. The PRM is effectively derived without additional labeling costs, making the process more accessible.
Empirical Evaluation: The paper provides empirical evidence that these implicit PRMs achieve superior performance without the extensive overhead of traditional PRM training. Using datasets like MATH, implicit PRMs instantiated with various objectives consistently outperform competitive baselines, including established models such as Math-Shepherd and AutoPSV.
Training Efficiency: The approach dramatically reduces the data collection and training overhead by 38.8 times compared to methods like Math-Shepherd that require Monte Carlo Tree Search (MCTS) for step-level data annotation. The efficiency is achieved while maintaining cutting-edge performance on best-of-N sampling tasks, crucial for tasks involving extensive mathematical reasoning.

Technical Insights and Implications

Efficient Reward Parameterization: The use of log-likelihood ratios aligns with existing direct preference optimization (DPO) and its variants, allowing for automatic learning of the Q-function during ORM training. This method demonstrates theoretical and practical potential by streamlining the reward learning process.
Scalability and Resource Usage: The researchers highlight that increasing the scale of responses benefits PRM performance more significantly than scaling the number of instructions. Notably, models trained with cross-entropy loss were shown to be effective even when data is scarce, showcasing robustness in imbalanced data scenarios.
Advanced Sampling Techniques: The integration of majority voting with the implicit PRMs enhances performance further, exemplifying how fine-tuned aggregation and sampling methods contribute to improved decision-making in neural models, which is a pivotal aspect for AI that operates under conditions of uncertainty.

Future Directions

The findings suggest a paradigm shift in how reward models can be conceptualized and trained. By eliminating the dependency on step-level annotations, this approach opens doors for scalable training of reward models in various domains beyond mathematics. The application of this technique to other forms of structured reasoning or complex decision tasks signals a promising expansion field. Future lines of investigation could explore augmenting implicit PRMs with refined annotation methodologies or extend the concept to domains where labeled intermediate processes are similarly prohibitive.

Ultimately, this paper contributes a meaningful step towards optimizing computational costs while ensuring high accuracy and efficiency in training reward models, propelling further exploration in AI advancements and complex reasoning tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WenhuChen/status/1878297953330196807

https://twitter.com/lifan__yuan/status/1875386357499175221

https://twitter.com/rohanpaul_ai/status/1865209687010168837

https://twitter.com/xpasky/status/1875695518715932795

https://twitter.com/ShumingHu/status/1876786907738189873

https://twitter.com/GptMaestro/status/1880751724153147432

Reddit

"Free Process Rewards without Process Labels", Yuan et al 2024 (15 points, 3 comments)