Process Reward Models (PRMs)

Updated 24 June 2025

Process reward models (PRMs) are specialized models for assigning feedback to each intermediate step in multi-step reasoning tasks performed by LLMs. Unlike outcome reward models (ORMs), which assign a reward only to the final result, PRMs offer granular supervision by predicting the likelihood that a partial reasoning trajectory will yield a correct final answer. While PRMs play a pivotal role in guiding inference-time scaling algorithms—such as best-of-N sampling or reward-guided search—the accuracy and reliability of their predicted probabilities are crucial. State-of-the-art PRMs have been observed to systematically overestimate the probability of success, resulting in inefficient compute allocation and unreliable uncertainty estimates, particularly on hard or out-of-distribution problems.

1. Quantile Regression-Based Calibration for PRMs

The paper introduces a method for calibrating PRMs using quantile regression. The goal is to ensure that the predicted probability of success for each reasoning prefix (i.e., partial solution) better aligns with the true empirical likelihood of eventual task success.

Calibration Procedure:
- For each query and partial reasoning trajectory, the empirical probability of reaching a correct answer ( $\tilde{p}^{(i, t)}$ ) is estimated by Monte Carlo rollouts, i.e., sampling continuations and measuring the fraction that yield the right answer.
- The PRM is then fine-tuned using quantile regression to output multiple quantiles (e.g., 10th, 50th, 90th) of the predicted success probability.
- The quantile regression loss is:
$\mathsf{wQL}(\hat{r}, \tilde{p}) = \frac{1}{N_q} \sum_{n=1}^{N_q} \left[ \beta_n \cdot \max(0, \tilde{p} - \hat{r}^{(\beta_n)}) + (1 - \beta_n) \cdot \max(0, \hat{r}^{(\beta_n)} - \tilde{p}) \right]$

where $\hat{r}^{(\beta_n)}$ is the model’s predicted $\beta_n$ -quantile for success probability, and $N_q$ is the number of quantiles.
Rationale and Significance:
- Quantile regression allows the calibrated PRM to provide conservative lower bounds (e.g., $\beta=0.1$ ), which can be used to avoid overestimating the likelihood of success—crucial for compute allocation and reliability in downstream applications.

2. Instance-Adaptive Scaling (IAS) Framework

The paper introduces instance-adaptive scaling: a principled, uncertainty-aware approach to dynamically allocating inference-time compute for each query.

Framework Overview:
- Given a calibrated lower-bound probability of success $p$ for a given reasoning prefix, the minimum number of independent samples needed to achieve a target coverage probability $C$ (i.e., confidence that at least one trajectory succeeds) is:
$N_{IAS}(p, C) = \frac{\log(1 - C)}{\log(1 - p)}$ - This function is used to allocate compute per instance: queries predicted to be easy (large $p$ ) require fewer samples; hard queries trigger more extensive search.
Integration with Search Algorithms:
- For best-of-N strategies, the number of samples $N$ per query is set to $\lceil N_{IAS}(p, C) \rceil$ .
- For beam search and variants, the IAS mechanism dynamically expands prefixes based on their stepwise lower-bound probabilities, using the formulas outlined in the paper (e.g., for beam size, for prefix selection).
Conformal Guarantee:
- If the lower-quantile is properly calibrated, the IAS framework provides a statistical guarantee that the coverage probability is at least $C$ for each query.

3. Empirical Results and Validation

Experiments are conducted on diverse mathematical reasoning benchmarks, including MATH500 (in-domain) and AIME 24-25 (out-of-distribution), using a variety of LLMs and reward models.

Calibration Metrics:
- Calibration is evaluated using Brier score, positive Brier score, ECE (expected calibration error), and ACE (average calibration error).
- Quantile regression calibration achieves lower calibration errors than standard methods such as temperature scaling, isotonic regression, or histogram binning, especially on out-of-distribution test sets.
Instance-Adaptive Scaling Efficiency:
- Compute Reduction: IAS approaches achieve equivalent or better final answer accuracy with 3–4× less compute compared to fixed-budget best-of-N or beam search. In harder or more OOD tasks, savings may be even greater.
- Adaptive Budgeting: Allocation of more samples to difficult queries results in less wasted compute on easy cases and improved resource utilization overall.
- Necessity of Calibration: Uncalibrated PRMs lead to substantial underestimation of required search and accuracy losses, confirming that quantile regression calibration is necessary for practical IAS deployment.
Robustness: The calibration+IAS pipeline is shown to be robust across various PRM architectures and instruction-tuned LLM families.

4. Implications for PRM Design and Inference

Principled Uncertainty Quantification: Calibrated PRMs now provide actionable, instance-specific success probabilities and confidence bounds, moving beyond relative solution ranking to genuinely trustworthy supervision.
Compute-Efficient Inference: Adaptive scaling guided by calibrated uncertainty aligns compute investment with problem difficulty and user-desired confidence levels.
Interpretable Predictions: Lower confidence bounds from quantile regression enable PRMs to know “what they don’t know,” reducing overconfidence and supporting safer deployment in critical reasoning tasks.
Extensibility: The calibration strategy can be readily integrated with other process supervision objectives or RL-based LLM training, and the instance-adaptive scaling mechanism is immediately compatible with both best-of-N and tree/beam search-based inference.

5. Future Directions and Broader Impact

Generalization to Other Domains: The quantile regression and IAS strategies are not tied to mathematics alone and may be adapted to other domains involving step-wise reasoning or process supervision, such as code generation or autonomous agent planning.
Continual Learning and Lightweight Calibration: Future research may develop online or continual learning methods for PRM calibration, or more computationally efficient calibration pipelines tailored for large-scale deployment.
Integration with Model Routing: PRM uncertainty estimates could inform dynamic model selection or hierarchical compute allocation—e.g., triggering more powerful models or fallback strategies for uncertain queries.
Cost-Accuracy Tradeoff Metrics: The methodology supports principled development of metrics for cost-aware model selection, relevant for both research and real-world applications.

In summary, the quantile regression-based calibration and instance-adaptive scaling framework for PRMs establishes a new standard for uncertainty-aware, compute-efficient process supervision in LLMs. Experimental evidence demonstrates its value for both in-domain and out-of-domain reasoning, with immediate applicability to large-scale and cost-sensitive deployed AI systems.

PDF Markdown Bookmark Chat (Pro)