Outcome-Supervised Reward Models

Updated 4 October 2025

Outcome-supervised Reward Models (ORMs) are learned models that assign a single scalar reward based solely on the final output, offering an efficient, outcome-level supervision mechanism.
ORMs employ training techniques like binary cross-entropy and pairwise losses to re-rank candidate responses in domains such as code generation, text-to-SQL, and robotic manipulation.
Despite their label efficiency, ORMs may miss fine-grained process errors, prompting hybrid strategies that integrate step-wise supervision for enhanced reasoning transparency.

Outcome-supervised Reward Models (ORMs) are a class of learned models that assign scalar values—interpreted as rewards or utility scores—to the outputs of generative models, with the critical property that these scores are determined solely by the observed outcome or final result, regardless of the intermediate steps that produced it. ORM signals are typically “sparse,” providing a single assessment for an entire response or trajectory, as opposed to the dense, stepwise supervision yielded by process reward models (PRMs). In domains such as mathematical reasoning, code generation, logic, SQL synthesis, tool-oriented language modeling, and robotic manipulation, ORMs enable end-to-end supervision at minimal annotation cost, underpin scalable learning and inference pipelines, and introduce favorable label efficiency. However, these advantages come at the potential expense of fine-grained reasoning quality, since the ORM cannot (in its purest form) distinguish between transparent, trustworthy step-wise logic and accident-prone or “lucky” solution traces that yield correct endpoints via incorrect reasoning.

1. Fundamental Definition and Mechanisms

ORMs are trained and applied using labels exclusively tied to the correct or incorrect status of the final output ( $y$ ) produced by a policy $\pi$ for a given input $x$ . For most practical implementations:

The model receives input–output pairs $(x, y)$ , where $y$ is a full sequence (reasoning trace, code completion, tool-call script, etc.).
The reward model $r_\theta(x, y)$ is trained to predict whether $y$ is correct, often using binary cross-entropy or preference-based objectives:

$L_{\mathrm{ORM}} = \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \mathrm{Loss}(\mathrm{Correct}(x,y), r_\theta(x, y)) \right]$

During inference or RL, the ORM assigns a single solution-level reward (e.g., at the last token) that is used for re-ranking in best-of- $k$ pipelines or for providing a candidate reward to policy-gradient updates (Uesato et al., 2022).

This “reward by outcome” allows for dramatic reductions in annotation scope because verifying a final answer (e.g., matching a ground-truth integer, SQL execution result, or test-case pass) is often automatic and requires no stepwise judgment, in contrast to PRMs that necessitate costly intermediate labeling.

2. ORM Training Objectives, Supervision, and Implementation

ORMs are usually implemented as discriminative classifiers (e.g., transformer-based networks or smaller MLPs over final representations) that take the concatenation of a problem prompt and a complete candidate trace as input, producing a scalar reward score. Training regimes span:

Binary classification using cross-entropy (label “1” for correct, “0” for incorrect).
Pairwise preference losses, such as Bradley–Terry:

$L(\theta; Y_n) = \frac{1}{|Y_+||Y_-|} \sum_{y_+ \in Y_+} \sum_{y_- \in Y_-} \log\left[1 + \exp\left(r_\theta(y_+) - r_\theta(y_-)\right)\right]$

where $Y_+$ and $Y_-$ are the correct and incorrect solution sets per problem instance.

Energy-based models (EBMs), which treat the output scalar as a (negative) energy $E_\theta(y)$ (lower is better), admitting ranking and probabilistic selection through the Boltzmann distribution (Jiang et al., 21 May 2025).

A salient design is to use outcome labels only at training time (“response-level” supervision) yet apply the ORM for differentiable scoring of full candidate outputs during RL, rejection sampling, or best-of- $N$ inference (Pan et al., 2023, Gao et al., 19 Oct 2024, Tritto et al., 1 Sep 2025).

3. Label Efficiency, Verification, and Trade-offs

The principal strength of ORM supervision is its label efficiency—verifying correctness at the outcome level requires a single check per response, enabling rapid data collation and cheaper annotations versus PRM/stepwise supervision (Uesato et al., 2022, Zhou et al., 10 Jun 2025). In code, robotics, and SQL, correctness is often externally verifiable. In consequence:

Typical data collection scales to millions of samples, supporting training at foundation model regimes (Agarwal et al., 15 Sep 2025, Tritto et al., 1 Sep 2025).
In code or tool use, ORMs support scalable verification pipelines. For example, employing a generate-prune-then-rank design, fast, weak verifiers (syntax, lightweight tests) filter incorrect candidates, and the ORM is then applied for fast but acceptably accurate ranking; the process operates an order of magnitude faster than full test suite evaluation with modest accuracy trade-off (Orlanski et al., 11 Jun 2025).
In text-to-SQL, ORM-based reranking raises execution accuracy by 2–4% over execution or frequency-based baselines, especially as the candidate set scales (Tritto et al., 1 Sep 2025).
In logical reasoning, ORM reranking (with innovative echo-mode data) robustly outperforms majority voting in test-time scaling for multi-candidate verification (Thatikonda et al., 27 Aug 2025).

The principal limitation is that outcome supervision provides no explicit feedback on step-wise reasoning quality: models may achieve correct final answers via non-transparent or flawed intermediate steps (the “trace error” or “reasoning error” problem) (Uesato et al., 2022, Ye et al., 3 Sep 2025).

4. Limitations, Hybridization, and Recent Extensions

Pure ORMs are limited by their inability to detect or penalize internally inconsistent reasoning if the end result is correct. Several extensions and hybrid approaches have emerged:

Implicit PRMs from ORM: By parameterizing the outcome reward as a log-likelihood ratio between a policy and reference model, token-level Q-values can be computed without step annotation. The advantage is that even models trained only with outcome-level labels can produce process-aligned stepwise value functions, improving data efficiency by >38 $\times$ compared to MCTS-annotated PRMs (Yuan et al., 2 Dec 2024).
Intra-trajectory consistency: Regularizing adjacent process rewards via next-token generation probabilities helps propagate outcome supervision across steps, improving generalization and process alignment for DPO-trained policies (Zhou et al., 10 Jun 2025).
Hybrid curation: Strategies such as the PROF filter harmonize process (PRM) and outcome (ORM) rewards by selecting rollouts where process rewards and outcome align, thus both maintaining verification and promoting process coherency while avoiding direct blending and reward hacking (Ye et al., 3 Sep 2025).
Dual-consistency PRMs: Addressing the “granularity mismatch” in inference-time use (e.g., reward-guided search needing stepwise rewards from solution-level models), methods such as SP-PRM enforce score and preference consistency across partial and full outputs—extending ORMs to yield process-aligned feedback using only truncated outputs and reference model guidance (Xie et al., 14 Jun 2025).

These methods indicate that even when outcome-level labels are the only supervision available, ORM training strategies can be adapted to yield effective proxy process rewards, closing much of the empirical gap in reasoning trace quality without incurring annotation cost.

5. Applications Across Domains

ORMs have been successfully integrated into a spectrum of reasoning-intensive tasks:

Domain	ORM Application	Reference
Mathematical Reasoning	Reranking, RLHF, RL	(Uesato et al., 2022, Pan et al., 2023, Jiang et al., 21 May 2025, Wang et al., 16 Mar 2025)
Code Generation	Verification, reranking	(Yu et al., 19 Dec 2024, Orlanski et al., 11 Jun 2025)
Logical Reasoning	Candidate selection, test-time scaling	(Thatikonda et al., 27 Aug 2025)
Text-to-SQL	Semantic reranking	(Tritto et al., 1 Sep 2025)
Tool Use	Reward-guided filtering, data-efficient fine-tuning	(Agarwal et al., 15 Sep 2025)
Robotics	Supervised reward inference from demonstrations	(Schwarzer et al., 25 Feb 2025)

In mathematical reasoning, ORM-based RL (SFT+ORM-RL, few-shot+ORM-RL) reduces final answer error rates and dramatically lowers reasoning error among correct answers (e.g., from 14% to 3.4%) (Uesato et al., 2022). In code verification, ORM ranking enables throughput increases of 9–12 $\times$ while maintaining high pass rates relative to full test suite evaluation (Orlanski et al., 11 Jun 2025). In tool-calling LLMs, domain-specific ORM training boosts small-model accuracy up to 25% over generalist reward models (Agarwal et al., 15 Sep 2025). In text-to-SQL, ORM reranking outperforms both execution-based and majority-vote heuristics, especially when scaling the number of candidates (Tritto et al., 1 Sep 2025).

6. Comparative Evaluation: ORMs Versus PRMs and Emerging Alternatives

ORMs provide coarse, outcome-only supervision, which is highly effective for guiding policies toward correct final outputs, especially in domains with easily checkable outcomes. However, they are limited in incentivizing fine-grained reasoning quality and can produce “spurious” correct answers via flawed intermediate logic, an issue that can lead to noisy policy updates and poor reasoning transparency (Uesato et al., 2022, Ye et al., 3 Sep 2025).

PRMs, in contrast, offer step-level credit assignment at the cost of annotation—enabling better process control but at risk of reward hacking and instability unless carefully regularized or filtered. Token-level PRMs and Q-function-based reward models (Q-RM) offer finer granularity and improved sample efficiency, with recent work showing faster convergence and higher final accuracy in RL settings compared to ORM (Chen et al., 29 May 2025).

Hybrid schemes and consistency-enforcing frameworks (PROF, SP-PRM, implicit PRM) attempt to blend the strengths of both: preserving the annotation economy and verification strength of ORMs while inducing process-awareness, either through careful sample curation, regularization, or learned value decomposition (Yuan et al., 2 Dec 2024, Ye et al., 3 Sep 2025, Xie et al., 14 Jun 2025).

7. Future Directions and Open Questions

Several open directions emerge for ORM research and deployment:

Unified process–outcome frameworks: Techniques such as outcome refining process supervision (for code, PRMs via generative process evaluation, or dual-consistency process modeling) suggest potential for ORM-derived process rewards that are both cheap and efficacious (Yu et al., 19 Dec 2024, Zhou et al., 26 Sep 2025).
Robustness and anti-reward-hacking: Further strategies are needed to address exploitation of ORM or PRM scoring by models generating verbose, redundant, or misleading traces for score maximization.
Scalability in high-throughput domains: ORM-pruned ranking is proven for code and SQL, but scaling to even larger candidate pools and integrating dynamic oracles remains an active area (Orlanski et al., 11 Jun 2025, Agarwal et al., 15 Sep 2025).
Transfer to new modalities/domains: The data efficiency and process-emulation property of ORMs supports exploration in cross-domain, visual, or multi-modal RLHF, robotic control, and beyond (Schwarzer et al., 25 Feb 2025).
Process reward introspection: Exploiting ORM-derived implicit Q-functions, intra-trajectory regularization, and generative process supervision can yield more interpretable and trustworthy LLMs, further supporting RLHF and broader AI alignment objectives (Yuan et al., 2 Dec 2024, Zhou et al., 10 Jun 2025, Zhou et al., 26 Sep 2025).

In conclusion, Outcome-supervised Reward Models are a foundational tool for scalable, label-efficient supervision in generative modeling and reinforcement learning for language, code, logic, robotics, and tool-augmented tasks. Continued advances in hybridization with process-awareness and sophisticated sample curation extend their impact for transparent, robust, and performant AI systems.