Papers
Topics
Authors
Recent
2000 character limit reached

PathFinder-PRM: Hierarchical Error Reward Model

Updated 11 December 2025
  • PathFinder-PRM is a hierarchical model that explicitly detects math and consistency errors at each reasoning step to improve error localization.
  • It uses a two-stage inference process where error detection informs scalar reward estimation for guiding solution generation.
  • Empirical results show state-of-the-art performance with increased data efficiency and enhanced reward-guided search in math reasoning tasks.

PathFinder-PRM is a hierarchical, error-aware process reward model designed to enhance fine-grained error detection and reward estimation in multi-step mathematical reasoning tasks. Unlike traditional Process Reward Models (PRMs), which evaluate each intermediate reasoning step with a single correctness score, PathFinder-PRM explicitly diagnoses math and consistency errors at each step before integrating these signals to compute a scalar reward. This architecture yields improved error localization, data efficiency, and end-to-end mathematical solution quality, as evidenced by state-of-the-art results on benchmarks such as PRMBench and ProcessBench (Pala et al., 26 May 2025).

1. Hierarchical Error-Aware Model Architecture

At the core of PathFinder-PRM is a sequential two-subtask decomposition per reasoning step sts_t in a solution chain S={s1,s2,,sT}S = \{s_1, s_2, \dots, s_T\} for a math problem QQ:

  • Subtask 1 (Error Detection): Predicts two binary labels: Mt{0,1}M_t \in \{0,1\} indicating if sts_t is math-error-free, and Ct{0,1}C_t \in \{0,1\} indicating if sts_t is consistency-error-free. These are conditioned on (Q,S1:t1,st)(Q, S_{1:t-1}, s_t) and given by

(Mt,Ct)=PRMerror(Q,S1:t1,st)(M_t, C_t) = \text{PRM}_{\text{error}}(Q, S_{1:t-1}, s_t)

  • Subtask 2 (Reward Estimation): Predicts a scalar reward Rt[0,1]R_t \in [0,1] for sts_t, conditioned explicitly on (Q,S1:t1,st,Mt,Ct)(Q, S_{1:t-1}, s_t, M_t, C_t):

Rt=PRMreward(Q,S1:t1,st,Mt,Ct)R_t = \text{PRM}_{\text{reward}}(Q, S_{1:t-1}, s_t, M_t, C_t)

Stepwise error typing is performed by explicitly discriminating between:

  • Math errors: arithmetic mistakes, misapplied formulas, or invalid algebraic manipulations.
  • Consistency errors: logical contradictions with previous steps or with problem constraints.

Both subtasks are implemented using a shared LLM backbone (Qwen2.5-Math-7B-Instruct) with special tokens to represent positive and negative labels. Inference involves two masked forward passes to avert autoregressive leakage between subtasks.

2. Three-Dimensional Supervision and Dataset Construction

Each reasoning step is annotated with a three-dimensional label vector ct=(ctmath,ctconsist,ctcorrect){0,1}3c_t = (c^{\text{math}}_t, c^{\text{consist}}_t, c^{\text{correct}}_t) \in \{0,1\}^3, where:

  • ctmathc^{\text{math}}_t: math-error-free ($1$) or not ($0$)
  • ctconsistc^{\text{consist}}_t: consistency-error-free ($1$) or not ($0$)
  • ctcorrectc^{\text{correct}}_t: both error-free and optimally helpful ($1$), otherwise $0$

Label sources include:

  • PRM800K: Human-annotated “correctness” labels (lt{1,0,1}l_t \in \{-1,0,1\}), mapped to three-dimensional labels. For lt=1l_t = -1 (“erroneous”), fine-grained error type is inferred using a distilled “judge” model (DeepSeek-R1-Distill-Qwen-32B) and prompt-based relabeling.
  • RLHFlow Mistral PRM: Automated Monte Carlo ±\pm labels, with 55K sampled trajectories annotated using the same judge model and filtered for consistency with base MC labels.

The final assembled dataset consists of \sim345K trajectories from PRM800K and \sim55K from RLHFlow Mistral, totaling approximately 400K reasoning trajectories.

3. Model Objective Functions and Training Protocol

Letting ptmath=Pr[Mt=1]p^{\text{math}}_t = \Pr[M_t = 1 | \cdots], ptconsist=Pr[Ct=1]p^{\text{consist}}_t = \Pr[C_t=1|\cdots], and ptr=Pr[Rt=1,Mt,Ct]p^r_t = \Pr[R_t=1|\cdots, M_t, C_t], the losses are defined as follows:

  • Error-type Detection Loss (per step):

Lerror,t=[ctmathlogptmath+(1ctmath)log(1ptmath)][ctconsistlogptconsist+(1ctconsist)log(1ptconsist)]\mathcal{L}_{\text{error}, t} = -\left[ c^{\text{math}}_t \log p^{\text{math}}_t + (1 - c^{\text{math}}_t) \log (1 - p^{\text{math}}_t) \right] - \left[ c^{\text{consist}}_t \log p^{\text{consist}}_t + (1 - c^{\text{consist}}_t) \log (1 - p^{\text{consist}}_t) \right]

  • Reward Estimation Loss (per step):

Lreward,t=[ctcorrectlogptr+(1ctcorrect)log(1ptr)]\mathcal{L}_{\text{reward}, t} = -\left[ c^{\text{correct}}_t \log p^r_t + (1 - c^{\text{correct}}_t) \log (1 - p^r_t) \right]

  • Total Loss:

L=λ1tLerror,t+λ2tLreward,t\mathcal{L} = \lambda_1 \sum_t \mathcal{L}_{\text{error}, t} + \lambda_2 \sum_t \mathcal{L}_{\text{reward}, t}

with equal weighting (λ1=λ2=1\lambda_1 = \lambda_2 = 1).

  • PRMScore (evaluation metric): The average of positive and negative F1-scores across multiple fine-grained error categories. For DD denoting error categories and F1d+F1^+_d, F1dF1^-_d being F1-scores for presence/absence of dd,

PRMScore=1DdDF1d++F1d2\text{PRMScore} = \frac{1}{|D|} \sum_{d \in D} \frac{F1^+_d + F1^-_d}{2}

Two-Pass Stepwise Inference

At inference per step sts_t:

  1. First pass: Predict Math\text{Math} and Consistency\text{Consistency} with both labels masked to prevent leakage.
  2. Second pass: Predict Correctness\text{Correctness}, conditioned on filled-in Math and Consistency predictions.

Pseudocode specifying this protocol:

1
2
3
4
5
6
7
8
9
10
11
Algorithm 1: Two-Pass Inference for PathFinder-PRM
Input: P = (Q, S_{1:t-1}, s_t), model θ
1. Input_1 ← P ∥ "Math: <mask> Consistency: <mask>"
2. logits_1 ← θ.forward(Input_1)
3. pred_math ← arg max logits_1 at Math-mask
   pred_consist ← arg max logits_1 at Consistency-mask
4. Input_2 ← P ∥ "Math: pred_math Consistency: pred_consist Correctness: <mask>"
5. logits_2 ← θ.forward(Input_2)
6. reward_prob ← softmax(logits_2) at Correctness-mask
7. R_t ← reward_prob(+ token)
Output: (pred_math, pred_consist, R_t)

PathFinder-PRM can be used for reward-guided solution generation as follows:

  1. At each step, sample KK next-token candidates from the policy model.
  2. Each candidate is scored using RtR_t from PathFinder-PRM.
  3. The candidate with the highest RtR_t is selected to extend the partial solution.
  4. Steps are repeated iteratively until a complete solution is generated.
  5. This process is repeated NN times (e.g., N=8N=8) to compute pass@N (“prm@8”) metrics.

5. Empirical Results

PathFinder-PRM yields state-of-the-art fine-grained error detection and reward-guided reasoning with high data efficiency:

Model Dataset Size PRMScore prm@8
Qwen2.5-Math-7B-PRM (prior SOTA) ~1.5M 65.5 46.8
PathFinder-PRM#1-7B ~0.4M 67.7 48.3
PathFinder-PRM#1-7B-PRM800K ~0.345M 65.0 46.9
  • On PRMBench, PathFinder-PRM#1-7B achieves PRMScore = 67.7, an improvement of +2.2 over the prior best, with approximately one-third the data.
  • On ProcessBench (first-error detection F1), PathFinder-PRM#1-7B yields Avg. F1 = 69.5 (for mixed data), approaching the unsupervised model’s score with three times less data.
  • Reward-guided greedy search (“prm@8”) reaches 48.3, +1.5 points over the strongest baseline.

Ablation studies indicate that removing either the separate error categories or the two-stage subtask design each causes a drop of up to 2.8 PRMScore or 2.5 prm@8 points, evidencing the benefit of hierarchical supervision and error decoupling.

6. Architectural Insights, Limitations, and Future Directions

The decoupled error detection framework yields several advantages:

  • Richer supervision: The use of separate detection heads for math and consistency errors prior to reward estimation provides a more informative training signal, yielding stronger representations of diverse reasoning failure modes.
  • Reduced signal confusion: Disentangling error detection from reward estimation clarifies each task; conflating them would force the model to address two objectives with a single label.

However, limitations remain:

  • Model scale: Experiments are limited to 7B-parameter backbones. Scaling up may further improve performance.
  • Dataset diversity: Incremental improvements saturate with further RLHFlow Mistral traces beyond 50K. Enhanced data curation could provide additional gains.
  • Error taxonomy: Only math and consistency errors are distinguished. More granular or hierarchical error categories—such as “redundancy” or “circular logic”—remain unexplored.
  • Cross-domain generalization: The methodological advances have yet to be tested on other domains such as code generation, logic puzzles, or scientific explanations.

A plausible implication is that hierarchical, error-aware reward modeling may generalize to monitoring structured reasoning in broader domains, with fine-grained supervision proving crucial for robust, interpretable process feedback.

7. Context and Significance

PathFinder-PRM’s principal innovation lies in its explicit two-stage, error-type-supervised decomposition. This approach demonstrated higher data efficiency and error localization compared to previous monolithic PRM architectures, establishing benchmarks for both fine-grained error detection and stepwise reward-guided solution generation with substantially less supervision (Pala et al., 26 May 2025). It suggests an effective paradigm for aligning LLM-generated multi-step reasoning with human verifiability and reliability, particularly where differentiating error modalities or achieving maximal data utility is critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PathFinder-PRM.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube