Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
GPT-4o
83 tokens/sec
Gemini 2.5 Pro Pro
64 tokens/sec
o3 Pro
41 tokens/sec
GPT-4.1 Pro
71 tokens/sec
DeepSeek R1 via Azure Pro
24 tokens/sec
2000 character limit reached

Process Reward Models (PRMs)

Last updated: June 14, 2025

Process Reward Models ° (PRMs °) deliver step-wise feedback during multi-step reasoning—improving credit assignment, sample efficiency, and accuracy for LLMs ° far beyond traditional outcome-only reward models (ORMs °). However, their effectiveness depends critically on data annotation quality, theoretical reward design, generalizability, and computational strategies for scalable learning and deployment. Below, these aspects are distilled and synthesized from the latest research (Setlur et al., 10 Oct 2024 ° ).


Why Use Process Reward Models?

Traditional ORMs judge only the final outcome:

  • Sparse feedback: Ineffective for long chains or multi-step tasks °
  • Slow credit assignment: Hard to recognize which part of the reasoning caused success or failure
  • Sample/computational inefficiency: Difficult to improve exploration or fix error-prone steps

Process Reward Models (PRMs) address these issues by providing step-level supervision. For a given multi-step reasoning trace

y=(a1,a2,...,aH)y = (a_1, a_2, ..., a_H)

PRMs score each step aha_h in context, aiming to expose and reward progress, not just end results.


Standard Value-Based PRMs (and Their Limitations)

Value-based PRMs assign rewards using the value function:

Qπ(sh,ah)=Eah+1:Hπ[R((a1,...,aH),yx)]Q^\pi(s_h, a_h) = \mathbb{E}_{a_{h+1:H} \sim \pi}[R((a_1, ..., a_H), y^*_x)]

where shs_h is the partial solution up to ah1a_{h-1}, and RR is an indicator of final correctness.

Key drawbacks:

  • Mixes state & action value: Blurs distinction between being in a promising state vs. taking a promising action.
  • Limited exploration: Favors high-probability but possibly suboptimal traces; poor at discovering low-frequency but important strategies.
  • Data-inefficient rewards: Collecting per-step labels (especially from humans) is unscalable for complex domains.

Process Rewards as Measured Progress: The "Advantage" Perspective

The paper proposes that process rewards ° should measure progress at each step—specifically, the increase in the probability of eventual success, analogously to advantage functions in RL:

Aμ(sh,ah)=Qμ(sh,ah)Vμ(sh)A^\mu(s_h, a_h) = Q^\mu(s_h, a_h) - V^\mu(s_h)

where Vμ(sh)=Eahμ[Qμ(sh,ah)]V^\mu(s_h) = \mathbb{E}_{a_h\sim \mu}[Q^\mu(s_h,a_h)], and μ\mu is a "prover" policy (not necessarily the base policy being improved).

Benefits:

  • Action-separating credit: Measures change in solution likelihood, clearly attributing progress to an action versus general context.
  • Facilitates exploration: Stepwise advantage rewards encourage exploration, as making progress is directly incentivized, not just maintaining probable trajectories.
  • Dense signal for learning/search: Provides a reward at every step, so test-time search, RL, and policy improvement become more efficient.

Why Use a Distinct "Prover" Policy for Advantages?

  • Complementary exploration: If μ\mu is too strong (always succeeds), all actions appear good; if too weak, signal is uninformative.
  • Optimal provers are "intermediate": "Best-of-K" sampling from the base policy, with KK not too large, often works well.
  • Empirical evidence: Weak provers can help strong policies via advantage rewards, and vice versa; diversity helps.

Mathematical Formulation in RL

The reinforcement learning objective ° under this framework becomes

PAVRLπ(π)=ORMRL(π)+αh=1HEsh,ah[Aμ(sh,ah)]\ell^{\pi'}_{\mathrm{PAV-RL}}(\pi) = \ell_{\mathrm{ORM-RL}}(\pi) + \alpha \sum_{h=1}^H \mathbb{E}_{s_h, a_h}[A^\mu(s_h, a_h)]

  • ORMRL\ell_{\mathrm{ORM-RL}} is the standard final-outcome RL objective °
  • α\alpha tunes the contribution of process advantages to policy updates

Corresponding policy gradient:

πPAVRLππ=π=h=1Hπlogπ(ahsh)(Qπ(sh,ah)+αAμ(sh,ah))\nabla_\pi \ell^{\pi'}_{\mathrm{PAV-RL}} \Big|_{\pi'=\pi} = \sum_{h=1}^H \nabla_\pi \log \pi(a_h \mid s_h)\cdot \Big(Q^\pi(s_h,a_h) + \alpha A^\mu(s_h,a_h)\Big)

This combines classic value learning ° with explicit process progress signals.


Theoretical and Empirical Characterization of Good Provers

Theory: Improvement is maximized when AμA^\mu distinguishes between base policy actions (high variance) and isn't negatively aligned with the base policy's own advantages.

  • Best-of-K as prover:
    • For small KK, advantages are informative; for large KK, all actions look equally good.
  • Alignment/misalignment: Too different a prover (AμA^\mu and AπA^\pi misaligned) can be unhelpful, but even weak provers can provide useful signals if actions are sufficiently distinguished.

Implementation in Practice

Data Generation & Training

  • Automated Advantage Computation: Compute QμQ^\mu and VμV^\mu via sampled continuations ° (under μ\mu) of each prefix.
  • Process Advantage Verifiers (PAVs °): Train a PRM ° to predict Aμ(sh,ah)A^\mu(s_h, a_h) given (sh,ah)(s_h, a_h).
  • Supervision sources: Instead of per-step human labels, use automated sampling & advantage computation for scale.

Using PAVs in LLM Systems

  • Test-time search: Use PAVs as dense stepwise verifiers for search heuristics ° (e.g., beam search, MCTS). Drop or expand paths based on step advantage, not just endpoint scores.
  • Online RL: Use PAVs as dense reward ° sources, massively boosting RL sample efficiency and exploration.
  • No need for human-intensive annotation: Scaling up is possible via automated sampling and prover policy design °.

Empirical Results

Task PAV ° (Advantage) ORM ° Only (Final)
Test-time search accuracy ° +8% reference
Compute efficiency ° 1.5–5× reference
RL sample efficiency 5–6× reference
RL accuracy gain ° >6% reference
  • Exploration improves: PAV RL policies ° solve more hard problems, especially with limited compute or small NN in sampling.
  • Search is more robust: PAV-driven search covers more unique solutions and finds higher-accuracy answers with less compute.

Practical Recommendations

  1. For PRM developers: Train using process advantages under a prover distinct from the base policy—don't blindly follow the value function under the base policy.
  2. Automated advantage computation: Leverage automated (sample-based or value-function) computation for scalable per-step rewards.
  3. Tune prover strength: Use intermediate, not maximal, K in best-of-K ° provers for optimal signal.
  4. Integrate into RL and search: Use process advantage rewards directly in the reward or heuristic function ° for both training and inference.
  5. Empirically validate: Ablate with different provers and policy alignments to optimize performance in your target domeain.

Example: Implementing a Process Advantage Verifier

1
2
3
4
5
6
7
8
9
10
11
12
13
def compute_advantage(prover, s_h, a_h, n_samples=32):
    # Q^mu(s_h, a_h): Average final correctness from sampled continuations under prover
    q_value = average([
        is_correct(prover.complete(s_h + [a_h], n_steps))
        for _ in range(n_samples)
    ])
    # V^mu(s_h): Average Q-values for all possible actions from s_h under prover policy
    all_actions = get_possible_actions(s_h)
    v_value = average([
        compute_q_value(prover, s_h, a, n_samples)
        for a in all_actions
    ])
    return q_value - v_value
Train a PRM to regress Aμ(sh,ah)A^\mu(s_h,a_h) on (sh,ah)(s_h,a_h). At deployment, use PAV scores for ranking/reinforcing candidate steps.


Summary Table

Model Reward Granularity Reward Signal ° Credit Assignment Efficiency Empirical Gain
ORM Outcome-only Final correctness Weak (sparse) Low Baseline
Value-function PRM Step-wise Base policy value Mixed Modest +1–2%
PAV Step-wise (progress) Prover advantage Strong (explores) High +6–8%, 5x+

Conclusion

Process advantage verifiers ° (PAVs) and process reward models (PRMs) measuring stepwise progress under a distinct prover policy are a powerful, scalable approach to improving credit assignment, exploration, and sample efficiency for LLM reasoning ° and RL. They enable more accurate, faster learning and superior test-time search, with major practical gains and strong theoretical justification—all while reducing dependence on expensive, dense human supervision. This paradigm should be considered foundational for next-generation LLM ° reasoning and decision-making systems.