Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Lessons of Developing Process Reward Models in Mathematical Reasoning (2501.07301v1)

Published 13 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of LLMs, which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhenru Zhang (13 papers)
  2. Chujie Zheng (35 papers)
  3. Yangzhen Wu (2 papers)
  4. Beichen Zhang (27 papers)
  5. Runji Lin (18 papers)
  6. Bowen Yu (89 papers)
  7. Dayiheng Liu (75 papers)
  8. Jingren Zhou (198 papers)
  9. Junyang Lin (99 papers)

Summary

The paper addresses the challenges in developing effective Process Reward Models (PRMs) for mathematical reasoning in LLMs. It identifies limitations in current data annotation and evaluation methodologies, particularly those involving Monte Carlo (MC) estimation and Best-of-N (BoN) evaluation strategies.

The authors demonstrate that MC estimation-based data synthesis for PRMs yields lower performance and generalization compared to LLM-as-a-judge and human annotation methods. They attribute this to the fact that MC estimation relies on completion models to evaluate step correctness, which may introduce noise by generating correct answers from incorrect steps, or vice versa.

The paper also identifies potential biases in BoN evaluation strategies for PRMs:

  • Unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification.
  • Tolerance of PRMs for such responses leads to inflated BoN scores.
  • Existing PRMs exhibit a significant proportion of minimum scores concentrated on the final answer steps, revealing a shift from process to outcome-based assessment in BoN optimized PRMs.

To mitigate these issues, the authors develop a consensus filtering mechanism that integrates MC estimation with LLM-as-a-judge. This approach retains instances only when both LLM-as-a-judge and MC estimation agree on the error locations in the solution. Additionally, they advocate for a more comprehensive evaluation framework that combines response-level and step-level metrics. The consensus filtering mechanism improves both model performance and data efficiency in BoN evaluation and step-wise error identification tasks. The authors release a new PRM that outperforms existing open-source alternatives and provides guidelines for future research in building process supervision models.

The paper's key contributions include:

  • Identification of limitations in MC estimation-based data construction for PRMs.
  • Revelation of bias in using response-level BoN evaluation alone for PRMs.
  • Proposal of a consensus filtering mechanism that integrates MC estimation with LLM-as-a-judge.

The authors conducted preliminary trials to train PRMs via MC estimation-based reasoning step annotation. They found that PRMs trained via MC estimation did not show noticeable advantages over those trained on human-annotated data and lagged behind in identifying erroneous reasoning steps.

In the training data synthesis, the authors followed the MC estimation approach similar to Math-Shepherd to construct the PRM training data. They collected a large-scale dataset of approximately 500,000 queries with golden answers and generated 6-8 diverse responses using the Qwen2-Math-Instruct and Qwen2.5-Math-Instruct series models. These responses were systematically split into individual steps using the delimiter "\textbackslash n\textbackslash n". Step correctness was assessed by conducting eight independent completions starting from each step using Qwen2.5-Math-Instruct series, estimating step labels based on the empirical probabilities of each step yielding the correct final answer.

The trained PRMs were initialized from the supervised fine-tuned Qwen2.5-Math-7B/72B-Instruct models, replacing the original LLMing head with a scalar-value head consisting of two linear layers. PRMs were trained with either hard labels or soft labels. For hard labels, a step was treated as correct if any one of the eight completions yielded the correct final answer, and negative otherwise. For soft labels, the value (between 0 and 1) was determined as the proportion of completions leading to the correct final answers. The cross-entropy loss and mean squared error loss were calculated on the last tokens of each step for the binary classification task using hard labels and for the regression task using soft labels, respectively. All steps subsequent to those labeled as incorrect (label 0) were eliminated.

The trained PRMs were evaluated from two aspects: their utilities in improving downstream task performance and their abilities to identify specific erroneous steps in reasoning processes. Consistent with previous work, the authors employed the BoN sampling strategy for evaluation, selecting the highest-scored response from NN candidates according to a PRM. The evaluation metric is denoted as "prm@NN". Following previous work, eight responses (i.e., N=8N=8) were sampled from Qwen2.5-Math-7B-Instruct across multiple mathematical benchmarks. Each candidate response was scored using the product of all the individual scores of each step within the response. The result of majority voting among eight samplings (maj@8) was also reported as the baseline, and pass@8 as the upper bound. Additionally, evaluation was performed on ProcessBench to measure the capability of models to identify erroneous steps in mathematical reasoning.

The experimental results demonstrated that on the Best-of-8 evaluation, none of the PRMs achieved prm@8 scores superior to maj@8. Furthermore, on the ProcessBench, the PRMs trained with MC estimation exhibited significantly lower erroneous step localization capabilities compared to the PRM trained on human-annotated data.

The paper discusses critical lessons learned during PRM training, focusing on the limitations of MC estimation and the bias in using BoN as the sole evaluation metric.

MC estimation was found to incorporate value model principles into PRMs training, potentially introducing performance and generalization limitations. A comparison of data construction approaches (MC estimation, LLM-as-a-judge, and human annotation) showed that human annotation achieved the best performance with the least amount of data, followed by LLM-as-a-judge, while MC estimation performed the worst despite having the largest dataset.

The authors proposed a consensus filtering mechanism that integrates LLM-as-a-judge with MC estimation, retaining instances only when both methods agree on error locations. This approach demonstrated more efficient data utilization and surpassed existing open-source PRMs in BoN evaluation. Training with hard labels was found to outperform training with soft labels after data filtering.

The paper also discusses biases in BoN sampling for PRM performance evaluation. Unreliable policy models can generate responses with correct answers but flawed processes, leading to a misalignment between BoN evaluation criteria and PRM objectives. Limited process verification capability in PRMs can lead to BoN score inflation. Optimization solely focused on BoN evaluation can cause PRMs to shift from process-based to outcome-oriented assessment.

The authors trained PRMs using a data construction procedure comprising data expansion and data filtering. The expansion phase followed MC estimation, using hard labels. The filtering phase used an LLM (Qwen2.5-Instruct-72B) to verify the reasoning process step by step. A consensus filtering mechanism was implemented, filtering out instances where there was a discrepancy between the LLM-annotated and MC-estimated process labels. For the training task, cross-entropy loss was used on the tokens at the end of each step to train the binary classification task based on hard labels.

To validate the effectiveness of the trained PRMs, the authors conducted response-level BoN evaluation and step-level process errors identification task on ProcessBench. In rm@8, both Outcome Reward Models (ORMs) and PRMs were evaluated. For ORMs, Qwen2.5-Math-RM-72B was introduced, which assigns a single score to each complete response. For PRMs, the product of each step score was computed as the final response score.

The evaluation on policy model Qwen2.5-Math-7b-Instruct showed that Qwen2.5-Math-PRM-7B demonstrated superior performance compared to other PRMs of equivalent model scale. The Qwen2.5-Math-PRM-72B exhibited slightly better overall performance than Qwen2.5-Math-RM-72B. The evaluation results on ProcessBench showed that Qwen2.5-Math-PRM-7B demonstrated superior performance over all open-source models.

The paper concludes by summarizing the investigation of PRMs, highlighting the limitations of MC estimation and vanilla BoN evaluation, and proposing a consensus filtering strategy and a comprehensive evaluation framework. The experiments demonstrated that the proposed strategy significantly improves both data efficiency and model performance.

Youtube Logo Streamline Icon: https://streamlinehq.com