ProcessBench: Benchmark for Math Reasoning
- ProcessBench is a process-level benchmark that evaluates LLMs' ability to detect and localize errors in multi-step mathematical reasoning with expert-annotated solution traces.
- It comprises 3,400 test cases across domains like GSM8K, MATH, OlympiadBench, and Omni-MATH, each providing detailed stepwise annotations to mark the first occurrence of an error.
- The benchmark facilitates the development of process reward and critic models by employing metrics such as error localization accuracy and F1 score, enhancing scalable oversight and automated correction of LLM reasoning.
ProcessBench is a process-level benchmark designed to evaluate the ability of models, particularly LLMs, to detect and localize errors in multi-step mathematical reasoning traces. Each benchmark instance consists of a math problem, a proposed chain-of-thought (CoT) solution decomposed into ordered steps, and an expert-annotated index indicating the earliest occurrence of an error or confirmation that all steps are correct. The benchmark focuses on competition- and Olympiad-level math problems, offering a challenging and fine-grained test for error localization mechanisms critical for scalable oversight, reward modeling, and automated correction of LLM reasoning (Zheng et al., 2024, Zhong et al., 16 Feb 2025, Rahman et al., 2 Dec 2025).
1. Dataset Structure, Scope, and Task Definition
ProcessBench comprises 3,400 test cases distributed among four canonical mathematical reasoning domains:
- GSM8K: Grade-school arithmetic word problems (400 cases)
- MATH: High-school to college-level competition math (1,000)
- OlympiadBench: Olympiad-style proof challenges (1,000)
- Omni-MATH: A universal Olympiad-level mix (1,000)
Each test case includes:
- A problem statement.
- A step-by-step candidate solution , sourced from diverse LLMs, ensuring a broad error profile.
- An annotation: either the index of the first erroneous step () or if the solution is flawless.
Error categories cover arithmetic, algebraic, logical, conceptual, and completeness faults. For correct final answers, the problem is further inspected for process errors; over 50% of "correct" OlympiadBench/Omni-MATH solutions contain missteps in earlier reasoning (Zheng et al., 2024, Rahman et al., 2 Dec 2025).
2. Annotation Protocol and Statistical Profile
Annotation is performed by teams of doctoral-level mathematicians. Each solution is independently labeled by three experts, with additional reviewers as needed to reach a consensus (up to five). On the hardest subsets, ~50% of cases require more than three annotators due to labeling ambiguity.
Statistical summary:
- Each subset is balanced on final-answer correctness.
- Average steps per solution increase with difficulty: GSM8K (5.2), MATH (6.5), OlympiadBench (8.8), Omni-MATH (8.0).
- Process errors concentrate early in solutions.
- Inter-annotator agreement is high for GSM8K/MATH (), but more difficult problems exhibit greater subjectivity (Zheng et al., 2024, Zhang et al., 26 Mar 2026).
3. Evaluation Methodology and Metrics
The central evaluation task is: Given a problem and solution trace, predict the earliest erroneous step or confirm complete correctness.
- Acc: Accuracy on examples with an error (proportion correctly localizing the first error).
- Acc: Accuracy on fully correct solutions (proportion correctly confirming validity).
- Harmonic mean score - the primary metric:
- Protocol: Models are provided with exactly one reasoning trace per problem, and must output either the error index or . No Best-of-N or ensemble decoding is used for the official ProcessBench evaluation (Zhong et al., 16 Feb 2025, Zheng et al., 2024).
4. Model Classes and Benchmarking Results
ProcessBench catalyzes the development of process reward models (PRMs), critic models, and verification architectures.
PRMs are trained to assign correctness probabilities at each step, using either human-annotated labels (e.g., PRM800K), automatic annotations (MC rollouts, compressed or denoised), or weak/unsupervised supervision.
Critic models (fine-tuned or prompted LLMs) output the error index using explicit stepwise critique.
Selected results (per (Zheng et al., 2024, Zhong et al., 16 Feb 2025, Rahman et al., 2 Dec 2025, Xu et al., 20 May 2025)):
| Model | GSM8K | MATH | Olympiad | Omni-MATH | Avg. F1 |
|---|---|---|---|---|---|
| GPT-4o (critic) | 61.9 | 53.9 | 48.3 | 44.6 | 61.9 |
| Qwen2.5-Math-7B-PRM (System 1) | 39.4 | 52.2 | 39.4 | 33.1 | — |
| Math-Shepherd-PRM-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
| PRM800K-fine-tuned (Qwen2.5-Math-7B) | 68.2 | 62.6 | 50.7 | 44.3 | 56.5 |
| Skywork-PRM-7B | 64.1 | 43.2 | 16.2 | 17.9 | 42.1 |
| SPC (7B, round 2) | — | — | — | — | 77.7** |
| R-PRM-7B-DPO | — | — | — | — | 70.4 |
| SPARK PRM-CoT (14B) | — | — | — | — | 65.7 |
| ActPRM-X (7B, active learning) | 82.7 | 82.0 | 72.0 | 67.3 | 76.0 |
Note: Some newer results report only the overall average F1 or metric variants (e.g., recall).
Emerging approaches such as self-play (SPC), active learning (ActPRM), compression (SCOPE), denoising (SCAN), unsupervised reward modeling (uPRM), and generative PRMs with code verification (GenPRM) have surpassed early baselines, especially on harder subsets (Chen et al., 27 Apr 2025, Xu et al., 20 May 2025, Xu et al., 20 May 2025, Sun et al., 4 Jun 2025, Rahman et al., 2 Dec 2025, Zhao et al., 1 Apr 2025). Ablations show key error-detection gains from multi-trajectory scaling, step-specific supervision, and explicit rationale/code synthesis.
5. Error Taxonomy, Complexity, and Representative Cases
ProcessBench focuses annotation at the first error due to causality: all downstream steps are to be judged relative to the earliest incorrect step.
Error typology (implicit in annotation):
- Arithmetic: miscalculations, sign errors
- Algebraic/Manipulation: faulty distribution, substitution, or symbol handling
- Conceptual: misapplied definitions or theorems
- Logical: unjustified inferences, missing or spurious case analysis
- Completeness: omitted justifications, ignored domain constraints
All such mistakes are mapped to a binary “incorrect” label at the step level, with no partial credit.
Difficulty gradient: As problems progress from GSM8K to Omni-MATH, solutions have more steps, higher process-error rates (e.g., 3.5% process errors on correct GSM8K answers vs. 51.8% on Omni-MATH), and annotation becomes more ambiguous (Zheng et al., 2024).
Representative case:
- If a compound interest calculation miscomputes as 1.41907 (true 0 1.4148), the first error is flagged at the corresponding step, and all subsequent steps that depend on this misstep are not independently scored (Rahman et al., 2 Dec 2025).
6. Impact, Limitations, and Future Directions
ProcessBench is the dominant testbed for evaluating fine-grained step-level reasoning verification in LLMs. Its influence is multifold:
- Oversight and Iteration: Enables automated detection/correction regimes and RL-based refinement that move beyond binary (final answer) reward signals.
- Reward Modeling: Guides the design of step-level, reference-free, and computation-efficient PRMs, supporting robust RL pipelines (Rahman et al., 2 Dec 2025, Ding et al., 20 Sep 2025, Xu et al., 20 May 2025).
- Generalization Stress-Test: Exposes generalization failures in models trained from purely outcome-based signals, particularly on Olympiad-level chains (Zheng et al., 2024, Zhong et al., 16 Feb 2025).
- Process-Aware Metrics: Establishes the F1-based diagnostic for stepwise judgment, which is now the de facto benchmark for process supervision efficacy (Zheng et al., 2024, Rahman et al., 2 Dec 2025).
Limitations:
- Annotation subjectivity and noise persist on the most advanced problems, even among experts.
- The existing protocol requires structured, stepwise solutions not always matched by real student work or unconstrained LLM outputs.
- Evaluation is single-trace and does not reward finding multiple error types in a single pass.
Future research directions include expanding the domain scope (e.g., to code, logic, or multimodal reasoning), integrating more nuanced error taxonomies, automating or denoising annotation pipelines (e.g., via compression, MC, or self-play), and studying latent-trajectory geometry in multi-step reasoning (Yuan, 20 Apr 2026, Gadetsky et al., 11 May 2026, Gao et al., 14 Apr 2026).
7. Comparison with Related Benchmarks
ProcessBench differs from other multi-step reasoning benchmarks in its explicit focus on process-level supervision and error localization, rather than only final-answer verification or rigid procedural imitation (cf. ProcBench for step-following (Fujisawa et al., 2024)). ProcessBench's design principles—expert-verification, multi-domain scope, and step-index labels—provide uniquely granular diagnostic signals critical for the design and robustification of both discriminative and generative verification architectures.
References:
- (Zheng et al., 2024) ProcessBench: Identifying Process Errors in Mathematical Reasoning
- (Zhong et al., 16 Feb 2025) Dyve: Thinking Fast and Slow for Dynamic Process Verification
- (Rahman et al., 2 Dec 2025) SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
- (Gadetsky et al., 11 May 2026) Unsupervised Process Reward Models
- (Xu et al., 20 May 2025) SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation
- (Ding et al., 20 Sep 2025) SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
- (Zhao et al., 1 Apr 2025) GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
- (Chen et al., 27 Apr 2025) SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
- (Yuan, 20 Apr 2026) Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling