Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProcessBench: Benchmark for Math Reasoning

Updated 30 May 2026
  • ProcessBench is a process-level benchmark that evaluates LLMs' ability to detect and localize errors in multi-step mathematical reasoning with expert-annotated solution traces.
  • It comprises 3,400 test cases across domains like GSM8K, MATH, OlympiadBench, and Omni-MATH, each providing detailed stepwise annotations to mark the first occurrence of an error.
  • The benchmark facilitates the development of process reward and critic models by employing metrics such as error localization accuracy and F1 score, enhancing scalable oversight and automated correction of LLM reasoning.

ProcessBench is a process-level benchmark designed to evaluate the ability of models, particularly LLMs, to detect and localize errors in multi-step mathematical reasoning traces. Each benchmark instance consists of a math problem, a proposed chain-of-thought (CoT) solution decomposed into ordered steps, and an expert-annotated index indicating the earliest occurrence of an error or confirmation that all steps are correct. The benchmark focuses on competition- and Olympiad-level math problems, offering a challenging and fine-grained test for error localization mechanisms critical for scalable oversight, reward modeling, and automated correction of LLM reasoning (Zheng et al., 2024, Zhong et al., 16 Feb 2025, Rahman et al., 2 Dec 2025).

1. Dataset Structure, Scope, and Task Definition

ProcessBench comprises 3,400 test cases distributed among four canonical mathematical reasoning domains:

  • GSM8K: Grade-school arithmetic word problems (400 cases)
  • MATH: High-school to college-level competition math (1,000)
  • OlympiadBench: Olympiad-style proof challenges (1,000)
  • Omni-MATH: A universal Olympiad-level mix (1,000)

Each test case includes:

  • A problem statement.
  • A step-by-step candidate solution S=[s0,s1,...,sn1]S = [s_0, s_1, ..., s_{n-1}], sourced from diverse LLMs, ensuring a broad error profile.
  • An annotation: either the index of the first erroneous step (t0t^* \geq 0) or 1-1 if the solution is flawless.

Error categories cover arithmetic, algebraic, logical, conceptual, and completeness faults. For correct final answers, the problem is further inspected for process errors; over 50% of "correct" OlympiadBench/Omni-MATH solutions contain missteps in earlier reasoning (Zheng et al., 2024, Rahman et al., 2 Dec 2025).

2. Annotation Protocol and Statistical Profile

Annotation is performed by teams of doctoral-level mathematicians. Each solution is independently labeled by three experts, with additional reviewers as needed to reach a consensus (up to five). On the hardest subsets, ~50% of cases require more than three annotators due to labeling ambiguity.

Statistical summary:

  • Each subset is balanced on final-answer correctness.
  • Average steps per solution increase with difficulty: GSM8K (5.2), MATH (6.5), OlympiadBench (8.8), Omni-MATH (8.0).
  • Process errors concentrate early in solutions.
  • Inter-annotator agreement is high for GSM8K/MATH (κ0.85\kappa \sim 0.85), but more difficult problems exhibit greater subjectivity (Zheng et al., 2024, Zhang et al., 26 Mar 2026).

3. Evaluation Methodology and Metrics

The central evaluation task is: Given a problem and solution trace, predict the earliest erroneous step or confirm complete correctness.

  • Accerr_\text{err}: Accuracy on examples with an error (proportion correctly localizing the first error).
  • Acccorr_\text{corr}: Accuracy on fully correct solutions (proportion correctly confirming validity).
  • Harmonic mean F1F_1 score - the primary metric:

F1=2AccerrAcccorrAccerr+AcccorrF_1 = \frac{2 \cdot \text{Acc}_\text{err} \cdot \text{Acc}_\text{corr}}{\text{Acc}_\text{err} + \text{Acc}_\text{corr}}

4. Model Classes and Benchmarking Results

ProcessBench catalyzes the development of process reward models (PRMs), critic models, and verification architectures.

PRMs are trained to assign correctness probabilities at each step, using either human-annotated labels (e.g., PRM800K), automatic annotations (MC rollouts, compressed or denoised), or weak/unsupervised supervision.

Critic models (fine-tuned or prompted LLMs) output the error index using explicit stepwise critique.

Selected results (per (Zheng et al., 2024, Zhong et al., 16 Feb 2025, Rahman et al., 2 Dec 2025, Xu et al., 20 May 2025)):

Model GSM8K MATH Olympiad Omni-MATH Avg. F1
GPT-4o (critic) 61.9 53.9 48.3 44.6 61.9
Qwen2.5-Math-7B-PRM (System 1) 39.4 52.2 39.4 33.1
Math-Shepherd-PRM-7B 47.9 29.5 24.8 23.8 31.5
PRM800K-fine-tuned (Qwen2.5-Math-7B) 68.2 62.6 50.7 44.3 56.5
Skywork-PRM-7B 64.1 43.2 16.2 17.9 42.1
SPC (7B, round 2) 77.7**
R-PRM-7B-DPO 70.4
SPARK PRM-CoT (14B) 65.7
ActPRM-X (7B, active learning) 82.7 82.0 72.0 67.3 76.0

Note: Some newer results report only the overall average F1 or metric variants (e.g., recall).

Emerging approaches such as self-play (SPC), active learning (ActPRM), compression (SCOPE), denoising (SCAN), unsupervised reward modeling (uPRM), and generative PRMs with code verification (GenPRM) have surpassed early baselines, especially on harder subsets (Chen et al., 27 Apr 2025, Xu et al., 20 May 2025, Xu et al., 20 May 2025, Sun et al., 4 Jun 2025, Rahman et al., 2 Dec 2025, Zhao et al., 1 Apr 2025). Ablations show key error-detection gains from multi-trajectory scaling, step-specific supervision, and explicit rationale/code synthesis.

5. Error Taxonomy, Complexity, and Representative Cases

ProcessBench focuses annotation at the first error due to causality: all downstream steps are to be judged relative to the earliest incorrect step.

Error typology (implicit in annotation):

  • Arithmetic: miscalculations, sign errors
  • Algebraic/Manipulation: faulty distribution, substitution, or symbol handling
  • Conceptual: misapplied definitions or theorems
  • Logical: unjustified inferences, missing or spurious case analysis
  • Completeness: omitted justifications, ignored domain constraints

All such mistakes are mapped to a binary “incorrect” label at the step level, with no partial credit.

Difficulty gradient: As problems progress from GSM8K to Omni-MATH, solutions have more steps, higher process-error rates (e.g., 3.5% process errors on correct GSM8K answers vs. 51.8% on Omni-MATH), and annotation becomes more ambiguous (Zheng et al., 2024).

Representative case:

  • If a compound interest calculation miscomputes (1.0175)20(1.0175)^{20} as 1.41907 (true t0t^* \geq 00 1.4148), the first error is flagged at the corresponding step, and all subsequent steps that depend on this misstep are not independently scored (Rahman et al., 2 Dec 2025).

6. Impact, Limitations, and Future Directions

ProcessBench is the dominant testbed for evaluating fine-grained step-level reasoning verification in LLMs. Its influence is multifold:

Limitations:

  • Annotation subjectivity and noise persist on the most advanced problems, even among experts.
  • The existing protocol requires structured, stepwise solutions not always matched by real student work or unconstrained LLM outputs.
  • Evaluation is single-trace and does not reward finding multiple error types in a single pass.

Future research directions include expanding the domain scope (e.g., to code, logic, or multimodal reasoning), integrating more nuanced error taxonomies, automating or denoising annotation pipelines (e.g., via compression, MC, or self-play), and studying latent-trajectory geometry in multi-step reasoning (Yuan, 20 Apr 2026, Gadetsky et al., 11 May 2026, Gao et al., 14 Apr 2026).

ProcessBench differs from other multi-step reasoning benchmarks in its explicit focus on process-level supervision and error localization, rather than only final-answer verification or rigid procedural imitation (cf. ProcBench for step-following (Fujisawa et al., 2024)). ProcessBench's design principles—expert-verification, multi-domain scope, and step-index labels—provide uniquely granular diagnostic signals critical for the design and robustification of both discriminative and generative verification architectures.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProcessBench.