ProcessBench: Identifying Process Errors in Mathematical Reasoning (2412.06559v4)

Published 9 Dec 2024 in cs.AI, cs.CL, and cs.LG

Abstract: As LLMs regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general LLMs to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general LLMs) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of LLMs.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces PROCESS BENCH, a benchmark with 3,400 expertly annotated step-by-step solutions for evaluating language models' ability to identify the earliest process error in mathematical reasoning.
Evaluation using PROCESS BENCH reveals existing Process Reward Models (PRMs) struggle to identify process errors, particularly on challenging problems, indicating limitations in current methodologies.
Critic models, based on general-purpose language models, demonstrate superior performance in identifying reasoning errors, suggesting a promising direction for scalable model oversight through prompt engineering.

An Overview of PROCESS BENCH: Evaluating Error Identification in Mathematical Reasoning

The paper, "PROCESS BENCH: Identifying Process Errors in Mathematical Reasoning," presents a benchmark aimed at assessing the capability of LLMs to identify errors in the reasoning process of solving complex math problems. With the increasing deployment of LLMs in tasks requiring sophisticated reasoning, the ability to detect errors is crucial for reliable and scalable oversight.

Benchmark Design and Objectives

PROCESS BENCH focuses on math problems predominantly from competition- and Olympiad-level contests, offering a challenging testbed with 3,400 test cases. Each test case features a step-by-step solution annotated by human experts to mark the earliest step containing an error. The benchmark aims to evaluate models' ability to detect erroneous reasoning steps rather than merely checking the correctness of final answers.

Key objectives in the design of PROCESS BENCH are:

Problem Difficulty and Solution Diversity: By including advanced math problems and diverse solution paths, the benchmark ensures robust evaluation of LLMs.
Scale and Accuracy: Featuring 3,400 cases with expert annotations ensures the benchmark's adequacy for large-scale assessments.
Evaluation Simplicity: The task requires identifying the earliest erroneous step, facilitating application across various model types, including process reward models (PRMs) and critic models.

Evaluation of Model Performance

The paper evaluates two types of models: PRMs and critic models. PRMs aim to evaluate intermediate reasoning steps, operating on labels derived during training, whereas critic models, based on conventional LLMs, offer reflective critique capabilities.

Process Reward Models (PRMs)

Results show that existing PRMs underperform, particularly when applied to more challenging problems surpassing the GSM8K and MATH datasets. This underperformance raises two concerns about current PRM methodologies: reliance on empirical probabilities of success leading to poor generalization and the models' inability to distinguish process errors despite correct final answers. Notably, a fine-tuned PRM on the PRM800K dataset exhibited improved performance, suggesting a potential path forward for model enhancements.

Critic Models

Critic models show superior performance, due to their ability to analyze solutions holistically, thus effectively identifying errors. Their critique capability is augmented by increased model sizes. The QwQ-32B-Preview open-source model notably rivals proprietary models like GPT-4o in performance, albeit still behind o1-mini. Thus, critic models demonstrate promising capabilities for advancing error identification in mathematical reasoning.

Implications and Future Directions

The presented results underscore the need for enhanced PRMs through more robust data synthesis methodologies. Moreover, the demonstrated success of general-purpose pre-trained models—through effective prompt engineering—as critic models indicates a viable direction for error identification in reasoning tasks without extensive domain-specific model training.

The implications of PROCESS BENCH extend into future explorations in automated reasoning process assessment and present a foundational cornerstone for scalable LLM oversight. Continuing research will likely focus on improving understanding of reasoning errors and enhancing LLMs’ intrinsic capacity to critique and refine their outputs autonomously. Additionally, bridging the critique capability gap between open-source and proprietary models remains an open challenge.

Conclusion

This paper introduces PROCESS BENCH as a comprehensive tool for evaluating LLMs’ ability to discern errors in complex mathematical reasoning. The benchmark's rigorous design sets the stage for further innovations in reasoning assessment and scalable model oversight, driving closer alignment of LLM capabilities with human-level reasoning standards.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ChujieZheng/status/1866503999773708758

https://twitter.com/rohanpaul_ai/status/1870034240043454497

https://twitter.com/rohanpaul_ai/status/1867393035103047953

https://twitter.com/fly51fly/status/1866595678006493574

https://twitter.com/GptMaestro/status/1868392888201318432