Unified Process–Outcome Reward Model

Updated 3 July 2026

Unified process–outcome reward models are frameworks that integrate dense process-level feedback with global outcome signals to improve multi-step reasoning.
They mitigate reward sparsity and misaligned credit assignment by balancing intermediate process rewards with final answer correctness.
They employ hybrid architectures and training algorithms, such as transformer-based RLHF and normalization techniques, to enhance overall performance and generalization.

A unified process–outcome reward model (PORM) integrates fine-grained process-level and coarse-grained outcome-level supervision, providing both dense feedback at intermediate reasoning steps and global signals tied to final correctness. This approach aims to overcome the reward sparsity and misaligned credit assignment that limit conventional outcome-only reward models (ORMs) and to mitigate the noise or reward-hacking pitfalls seen with process reward models (PRMs) when used in isolation. By formalizing and algorithmically combining these two reward sources, unified PORMs enable stable and efficient reinforcement learning, error localization, and robust alignment for complex multi-step reasoning in LLMs, mathematical agents, code LMs, and retrieval-augmented generation systems.

1. Formal Definitions and Core Components

Let $x$ denote the input (problem, prompt, or environment), $y$ the model's final output, and $\tau = (s_1, ..., s_T, y)$ the complete reasoning trajectory, with $s_t$ denoting the $t$ th intermediate state. Unified process–outcome reward models assign rewards as follows (Zheng et al., 9 Oct 2025):

Outcome Reward $R_o(x, y)$ : Assesses only the final answer, e.g.,

$r_o = R_o(x, y) = \sigma(f_{o, \phi}(x, y)) \in (0, 1)$

for a learned verifier, or as a hard label:

$R_o^{\mathrm{exact}}(y, y^*) = \begin{cases} 1, & y = y^* \ 0, & \text{otherwise} \end{cases}$

Process Reward $R_p(\tau)$ : Aggregates scores for each intermediate step:

$R_p(\tau) = \sum_{t=1}^T r_p(x, s_{1:t})$

where $y$ 0 is typically produced by a separate PRM.

Unified Reward: The canonical unified form is a weighted sum:

$y$ 1

with $y$ 2. Weights can be set by cross-validation, meta-learned, or adaptively determined per example (Zheng et al., 9 Oct 2025).

This definition encompasses both hard- and soft-labeled supervision and supports a spectrum of aggregation strategies (sum, min, step-wise gating) over process rewards, as required by application domain or training setup (Ye et al., 3 Sep 2025, Groeneveld et al., 27 Oct 2025).

2. Theoretical Rationale and Credit Assignment

The unification of process and outcome rewards addresses several deficiencies inherent to their isolated use:

Reward sparsity: ORMs provide delayed, sparse signals that hinder exploration and do not drive improvement of intermediate reasoning. PRMs (e.g., via marginal information gain, entropy, or MC estimation) offer dense, stepwise feedback (Wang et al., 1 Feb 2026, Zhang et al., 16 Oct 2025).
Credit assignment: Unified models enable precise temporal attribution. For example, Conditional Reward Modeling (CRM) reframes the joint probability over steps and outcome, enforcing that stepwise rewards reflect their contribution to the final result (Zhang et al., 30 Sep 2025). CRM defines the per-step shaped reward as

$y$ 3

and the total reward aligns with the log-conditional probability of the correct outcome given the trajectory.

Mitigating reward hacking: Reward normalization schemes, decoupled advantage normalization (Tan et al., 27 Mar 2026), and mechanisms such as the PRocess cOnsistency Filter (PROF) (Ye et al., 3 Sep 2025) or Principle-based Reward Normalization (ReNorm) (Xu et al., 29 Sep 2025) are employed to prevent policies from exploiting artifacts in dense process rewards (e.g., verbosity, repeated steps) while still utilizing their guidance.
Learnable credit assignment: LCA models treat the process-reward learning as a Multiple Instance Learning (MIL) problem, using softmax-weighted-sum (SWS) pooling to ensure correct credit flow under outcome-only supervision (Jia et al., 26 Jun 2026).

3. Model Architectures and Training Algorithms

Unified PORMs can be implemented atop various backbone architectures, including decoder-only transformers and value-head–augmented variants (Groeneveld et al., 27 Oct 2025):

Regression or Value Heads: Conditioned on input $y$ 4 and prefix $y$ 5, a regression head outputs $y$ 6, where $y$ 7 is the final hidden state at $y$ 8 (Groeneveld et al., 27 Oct 2025). Fine-tuning is then performed with binary cross-entropy on process and/or outcome labels.
RLHF and Joint Policy Objectives: Trajectories sampled from $y$ 9 are optimized with PPO-style losses, using unified reward for advantage computation:

$\tau = (s_1, ..., s_T, y)$ 0

where $\tau = (s_1, ..., s_T, y)$ 1 is based on cumulative unified reward and the value function $\tau = (s_1, ..., s_T, y)$ 2 is co-trained (Zheng et al., 9 Oct 2025, Tan et al., 27 Mar 2026).

Decoupled Normalization and Advantage Composition: PAPO (Tan et al., 27 Mar 2026) and PRPO (Ding et al., 12 Jan 2026) demonstrate effective formulations:

Name	Normalization Scope	Signal Flow
$\tau = (s_1, ..., s_T, y)$ 3	All responses	Encodes final outcome, anchors main gradient
$\tau = (s_1, ..., s_T, y)$ 4	Only correct responses	Differentiates quality within correct sequences

Advantage signals are then summed: $\tau = (s_1, ..., s_T, y)$ 5.

Sample/Trajectory Selection: PROF culls noisy or misaligned samples prior to RL updates, keeping only those with process-outcome consistency (Ye et al., 3 Sep 2025).

4. Process Reward Data Generation and Aggregation

High-quality process reward data remains a core challenge. Unified frameworks employ:

Automated annotation via uncertainty: Entropy and perplexity deltas localize error steps for efficient labeling, reducing human annotation cost (Han et al., 3 Aug 2025).
Tool-grounded step verification: Hybrid schemes such as GroundedPRM (Zhang et al., 16 Oct 2025) extend PRM supervision utilizing Monte Carlo Tree Search (MCTS) for tree-structured exploration, while external engines (e.g., Wolfram Alpha, SymPy) factually validate each step.
Aggregated output selection: To robustly combine process and outcome information at inference, hybrid voting schemes (Hybrid Majority Reward/HMR, Weighted Reward-Frequency/WRF) merge majority final-answer frequency with maximum or mean process rewards per candidate (Han et al., 3 Aug 2025).

5. Practical Pipelines, Benchmarks, and Empirical Results

Typical unified PORM pipelines consist of:

Data Collection: Assemble both ORM (final answer) and PRM (step-level) labels. Strategies include human annotation, automated verification, and uncertainty-driven selection (Zheng et al., 9 Oct 2025, Han et al., 3 Aug 2025, Zhang et al., 16 Oct 2025).
Model Component Training: Separately train outcome and process reward models, or jointly fine-tune with conditional or multi-task objectives.
Policy Learning: Reinforce the LLM policy with the unified reward, leveraging PPO, DPO, or critic-free methods such as PRPO (Ding et al., 12 Jan 2026). Additional curation/evaluation steps may filter, normalize, or reweight rewards.
Inference and Output Aggregation: Use learned reward models for best-of- $\tau = (s_1, ..., s_T, y)$ 6 sampling, reranking, beam search, or hybrid voting (Jia et al., 26 Jun 2026, Han et al., 3 Aug 2025).
Benchmarks: Evaluation is standardized on datasets such as MATH, GSM8K, ProcessBench, APPS (code), Natural Questions, and various agentic or multi-modal tasks, using both final accuracy and error localization metrics (Zheng et al., 9 Oct 2025, Zhang et al., 16 Oct 2025, Wang et al., 1 Feb 2026).

Empirically, unified approaches achieve consistent 2–6% gains over pure ORM RL baselines across mathematical, coding, and agentic tasks, with large relative improvements (up to 28%) reported in transfer and out-of-domain generalization (Tan et al., 27 Mar 2026, Xu et al., 29 Sep 2025). Process-level supervision accelerates learning, increases sample efficiency, and improves both intermediate reasoning quality and global correctness.

6. Failure Modes, Limitations, and Open Questions

Reward hacking and collapse: Direct blending or unfiltered use of PRMs can induce entropy collapse or verbosity exploitation (Ye et al., 3 Sep 2025). Decoupled normalization, reward centering (e.g., PPR+ReNorm), and strict gating mitigate these failures.
Data/model scale trade-offs: Automated strategies such as uncertainty-based construction (UnPRM) approach the F1 of much larger, human-labeled PRMs but still show gaps in rare or adversarial cases (Han et al., 3 Aug 2025). Fidelity-aware, tool-grounded PRMs (GroundedPRM) achieve substantial performance with far less supervision (Zhang et al., 16 Oct 2025).
Credit assignment ambiguity: Multiple Instance Learning and conditional modeling (CRM, LCA) provide algorithmic solutions but may require careful hyperparameter selection and can be limited by lack of verifiable step boundaries in complex tasks (Jia et al., 26 Jun 2026, Zhang et al., 30 Sep 2025).
Adaptive weighting and generalization: Optimal $\tau = (s_1, ..., s_T, y)$ 7 weighting remains task- and model-dependent, with ongoing research into meta-learning, trust-region scheduling, and adaptive combinations to maximize sample efficiency and robustness (Zheng et al., 9 Oct 2025, Wang et al., 1 Feb 2026).
Extension to non-verifiable and open-ended domains: Principles-based PRMs and normalized hybridization (PPR, ReNorm) extend to search and interaction tasks lacking ground-truth outcomes, but further work is needed for full scalability and automation (Xu et al., 29 Sep 2025).

7. Outlook and Research Directions

Unified process–outcome reward modeling is now established as the state-of-the-art paradigm for aligning LLM reasoning in multi-step, long-horizon, or tool-augmented settings. Open challenges include further automating high-fidelity process supervision, generalizing reward aggregation techniques for novel domains (e.g., planning, vision-language), refining adaptive credit assignment, and scaling unification to extreme sequence lengths. Directions such as hierarchical reward design, graph-based step credit, and universal process reward models (AURORA, VersaPRM) offer promising pathways (Zheng et al., 9 Oct 2025). Advances in MIL-based learning, tree-guided self-verification, and RL pipeline engineering are expected to underpin further progress in robust, interpretable, and high-performing LLM alignment via unified process–outcome reward frameworks.