PathFinder-PRM: Hierarchical Error Reward Model

Updated 11 December 2025

PathFinder-PRM is a hierarchical model that explicitly detects math and consistency errors at each reasoning step to improve error localization.
It uses a two-stage inference process where error detection informs scalar reward estimation for guiding solution generation.
Empirical results show state-of-the-art performance with increased data efficiency and enhanced reward-guided search in math reasoning tasks.

PathFinder-PRM is a hierarchical, error-aware process reward model designed to enhance fine-grained error detection and reward estimation in multi-step mathematical reasoning tasks. Unlike traditional Process Reward Models (PRMs), which evaluate each intermediate reasoning step with a single correctness score, PathFinder-PRM explicitly diagnoses math and consistency errors at each step before integrating these signals to compute a scalar reward. This architecture yields improved error localization, data efficiency, and end-to-end mathematical solution quality, as evidenced by state-of-the-art results on benchmarks such as PRMBench and ProcessBench (Pala et al., 26 May 2025).

1. Hierarchical Error-Aware Model Architecture

At the core of PathFinder-PRM is a sequential two-subtask decomposition per reasoning step $s_t$ in a solution chain $S = \{s_1, s_2, \dots, s_T\}$ for a math problem $Q$ :

Subtask 1 (Error Detection): Predicts two binary labels: $M_t \in \{0,1\}$ indicating if $s_t$ is math-error-free, and $C_t \in \{0,1\}$ indicating if $s_t$ is consistency-error-free. These are conditioned on $(Q, S_{1:t-1}, s_t)$ and given by

$(M_t, C_t) = \text{PRM}_{\text{error}}(Q, S_{1:t-1}, s_t)$

Subtask 2 (Reward Estimation): Predicts a scalar reward $R_t \in [0,1]$ for $s_t$ , conditioned explicitly on $(Q, S_{1:t-1}, s_t, M_t, C_t)$ :

$R_t = \text{PRM}_{\text{reward}}(Q, S_{1:t-1}, s_t, M_t, C_t)$

Stepwise error typing is performed by explicitly discriminating between:

Math errors: arithmetic mistakes, misapplied formulas, or invalid algebraic manipulations.
Consistency errors: logical contradictions with previous steps or with problem constraints.

Both subtasks are implemented using a shared LLM backbone (Qwen2.5-Math-7B-Instruct) with special tokens to represent positive and negative labels. Inference involves two masked forward passes to avert autoregressive leakage between subtasks.

2. Three-Dimensional Supervision and Dataset Construction

Each reasoning step is annotated with a three-dimensional label vector $c_t = (c^{\text{math}}_t, c^{\text{consist}}_t, c^{\text{correct}}_t) \in \{0,1\}^3$ , where:

$c^{\text{math}}_t$ : math-error-free ($1$) or not ($0$)
$c^{\text{consist}}_t$ : consistency-error-free ($1$) or not ($0$)
$c^{\text{correct}}_t$ : both error-free and optimally helpful ($1$), otherwise $0$

Label sources include:

PRM800K: Human-annotated “correctness” labels ( $l_t \in \{-1,0,1\}$ ), mapped to three-dimensional labels. For $l_t = -1$ (“erroneous”), fine-grained error type is inferred using a distilled “judge” model (DeepSeek-R1-Distill-Qwen-32B) and prompt-based relabeling.
RLHFlow Mistral PRM: Automated Monte Carlo $\pm$ labels, with 55K sampled trajectories annotated using the same judge model and filtered for consistency with base MC labels.

The final assembled dataset consists of $\sim$ 345K trajectories from PRM800K and $\sim$ 55K from RLHFlow Mistral, totaling approximately 400K reasoning trajectories.

3. Model Objective Functions and Training Protocol

Letting $p^{\text{math}}_t = \Pr[M_t = 1 | \cdots]$ , $p^{\text{consist}}_t = \Pr[C_t=1|\cdots]$ , and $p^r_t = \Pr[R_t=1|\cdots, M_t, C_t]$ , the losses are defined as follows:

Error-type Detection Loss (per step):

$\mathcal{L}_{\text{error}, t} = -\left[ c^{\text{math}}_t \log p^{\text{math}}_t + (1 - c^{\text{math}}_t) \log (1 - p^{\text{math}}_t) \right] - \left[ c^{\text{consist}}_t \log p^{\text{consist}}_t + (1 - c^{\text{consist}}_t) \log (1 - p^{\text{consist}}_t) \right]$

Reward Estimation Loss (per step):

$\mathcal{L}_{\text{reward}, t} = -\left[ c^{\text{correct}}_t \log p^r_t + (1 - c^{\text{correct}}_t) \log (1 - p^r_t) \right]$

Total Loss:

$\mathcal{L} = \lambda_1 \sum_t \mathcal{L}_{\text{error}, t} + \lambda_2 \sum_t \mathcal{L}_{\text{reward}, t}$

with equal weighting ( $\lambda_1 = \lambda_2 = 1$ ).

PRMScore (evaluation metric): The average of positive and negative F1-scores across multiple fine-grained error categories. For $D$ denoting error categories and $F1^+_d$ , $F1^-_d$ being F1-scores for presence/absence of $d$ ,

$\text{PRMScore} = \frac{1}{|D|} \sum_{d \in D} \frac{F1^+_d + F1^-_d}{2}$

4. Inference and Reward-Guided Search

Two-Pass Stepwise Inference

At inference per step $s_t$ :

First pass: Predict $\text{Math}$ and $\text{Consistency}$ with both labels masked to prevent leakage.
Second pass: Predict $\text{Correctness}$ , conditioned on filled-in Math and Consistency predictions.

Pseudocode specifying this protocol:

Algorithm 1: Two-Pass Inference for PathFinder-PRM
Input: P = (Q, S_{1:t-1}, s_t), model θ
1. Input_1 ← P ∥ "Math: <mask> Consistency: <mask>"
2. logits_1 ← θ.forward(Input_1)
3. pred_math ← arg max logits_1 at Math-mask
   pred_consist ← arg max logits_1 at Consistency-mask
4. Input_2 ← P ∥ "Math: pred_math Consistency: pred_consist Correctness: <mask>"
5. logits_2 ← θ.forward(Input_2)
6. reward_prob ← softmax(logits_2) at Correctness-mask
7. R_t ← reward_prob(+ token)
Output: (pred_math, pred_consist, R_t)

Reward-Guided Greedy Search

PathFinder-PRM can be used for reward-guided solution generation as follows:

At each step, sample $K$ next-token candidates from the policy model.
Each candidate is scored using $R_t$ from PathFinder-PRM.
The candidate with the highest $R_t$ is selected to extend the partial solution.
Steps are repeated iteratively until a complete solution is generated.
This process is repeated $N$ times (e.g., $N=8$ ) to compute pass@N (“prm@8”) metrics.

5. Empirical Results

PathFinder-PRM yields state-of-the-art fine-grained error detection and reward-guided reasoning with high data efficiency:

Model	Dataset Size	PRMScore	prm@8
Qwen2.5-Math-7B-PRM (prior SOTA)	~1.5M	65.5	46.8
PathFinder-PRM#1-7B	~0.4M	67.7	48.3
PathFinder-PRM#1-7B-PRM800K	~0.345M	65.0	46.9

On PRMBench, PathFinder-PRM#1-7B achieves PRMScore = 67.7, an improvement of +2.2 over the prior best, with approximately one-third the data.
On ProcessBench (first-error detection F1), PathFinder-PRM#1-7B yields Avg. F1 = 69.5 (for mixed data), approaching the unsupervised model’s score with three times less data.
Reward-guided greedy search (“prm@8”) reaches 48.3, +1.5 points over the strongest baseline.

Ablation studies indicate that removing either the separate error categories or the two-stage subtask design each causes a drop of up to 2.8 PRMScore or 2.5 prm@8 points, evidencing the benefit of hierarchical supervision and error decoupling.

6. Architectural Insights, Limitations, and Future Directions

The decoupled error detection framework yields several advantages:

Richer supervision: The use of separate detection heads for math and consistency errors prior to reward estimation provides a more informative training signal, yielding stronger representations of diverse reasoning failure modes.
Reduced signal confusion: Disentangling error detection from reward estimation clarifies each task; conflating them would force the model to address two objectives with a single label.

However, limitations remain:

Model scale: Experiments are limited to 7B-parameter backbones. Scaling up may further improve performance.
Dataset diversity: Incremental improvements saturate with further RLHFlow Mistral traces beyond 50K. Enhanced data curation could provide additional gains.
Error taxonomy: Only math and consistency errors are distinguished. More granular or hierarchical error categories—such as “redundancy” or “circular logic”—remain unexplored.
Cross-domain generalization: The methodological advances have yet to be tested on other domains such as code generation, logic puzzles, or scientific explanations.

A plausible implication is that hierarchical, error-aware reward modeling may generalize to monitoring structured reasoning in broader domains, with fine-grained supervision proving crucial for robust, interpretable process feedback.

7. Context and Significance

PathFinder-PRM’s principal innovation lies in its explicit two-stage, error-type-supervised decomposition. This approach demonstrated higher data efficiency and error localization compared to previous monolithic PRM architectures, establishing benchmarks for both fine-grained error detection and stepwise reward-guided solution generation with substantially less supervision (Pala et al., 26 May 2025). It suggests an effective paradigm for aligning LLM-generated multi-step reasoning with human verifiability and reliability, particularly where differentiating error modalities or achieving maximal data utility is critical.

PDF Markdown Chat (Pro)

References (1)

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PathFinder-PRM.

PathFinder-PRM: Hierarchical Error Reward Model

1. Hierarchical Error-Aware Model Architecture

2. Three-Dimensional Supervision and Dataset Construction

3. Model Objective Functions and Training Protocol

4. Inference and Reward-Guided Search

Two-Pass Stepwise Inference

Reward-Guided Greedy Search

5. Empirical Results

6. Architectural Insights, Limitations, and Future Directions

7. Context and Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PathFinder-PRM: Hierarchical Error Reward Model

1. Hierarchical Error-Aware Model Architecture

2. Three-Dimensional Supervision and Dataset Construction

3. Model Objective Functions and Training Protocol

4. Inference and Reward-Guided Search

Two-Pass Stepwise Inference

Reward-Guided Greedy Search

5. Empirical Results

6. Architectural Insights, Limitations, and Future Directions

7. Context and Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research