Outcome-Supervised Reward Models

Updated 11 December 2025

Outcome-Supervised Reward Models (ORMs) are defined as functions that assign a scalar score to full LLM output trajectories based solely on final, verifiable outcomes.
ORMs utilize training objectives like binary classification and pairwise preference to overcome the need for dense, stepwise annotations, enhancing efficiency.
They are applied for robust reranking and evaluation in domains such as reasoning, code generation, SQL synthesis, and multi-agent tasks, leading to improved performance.

Outcome-Supervised Reward Model (ORM) refers to a family of reward models that assign scalar scores to complete output trajectories of LLMs, using only supervision derived from final, verifiable outcomes—such as correctness of an answer, passing of test cases, or semantic agreement with ground truth—rather than intermediate process-level annotations. ORMs serve as sequence-level or trajectory-level validators for numerous domains, providing efficient, robust, and data-efficient alternatives to process-supervised reward models (PRMs), particularly in settings where dense stepwise labels are impractical to obtain. The ORM paradigm underpins key advances in reasoning, coding, logical inference, code generation, SQL synthesis, multimodal reasoning, and computer-using agent evaluation.

1. Formal Definition and Core Training Objectives

An Outcome-Supervised Reward Model defines a scalar function (often denoted $r_\theta$ or $R_\phi$ ) mapping a prompt–response pair, $(x, y)$ , to a real-valued score or reward: $r_\theta(x, y) \in \mathbb{R}$ In most RLHF, alignment, or search-based learning pipelines, $r_\theta(x,y)$ is trained solely on outcome supervision, such as human preference comparisons over full responses, binary correctness labels, or execution success/failure.

Common training objectives include:

Binary Cross-Entropy (CE)/Classification: The ORM predicts the probability $\sigma(r_\theta(x, y))$ that $y$ is "correct." The CE loss is minimized over binary-labeled examples $l \in \{0, 1\}$ : $\mathcal{L}_{\mathrm{CE}} = -l \log \sigma(r_\theta(x, y)) - (1-l) \log (1 - \sigma(r_\theta(x, y)))$
Bradley–Terry/Pairwise Preference: On preference data $\mathcal{D} = \{(x, y^w, y^l)\}$ , minimize

$\mathcal{L}_{\mathrm{outcome}} = -\mathbb{E}_{(x,y^w,y^l) \sim \mathcal{D}}\left[\log \sigma(r_\theta(x,y^w) - r_\theta(x,y^l))\right]$

Outcome-Only Reinforcement Learning: Reward assigned only at trajectory end, e.g., $R_{\mathrm{ORM}}(q, s) = r_\theta(q, s)$ (Gao et al., 19 Oct 2024).

Some advanced parameterizations (notably implicit PRMs) express $r_\theta(y)$ as a log-likelihood ratio between policy and reference models, enabling process rewards to be recovered without stepwise annotation (Yuan et al., 2 Dec 2024).

2. Architecture, Supervision, and Inference

Model Architecture

ORMs are typically realized as:

Classifier Heads atop pretrained transformers, acting on whole response sequences. For code, math, and SQL, the architecture involves a transformer encoder whose final state is fed to a classification or regression head (Pan et al., 2023, Thatikonda et al., 27 Aug 2025, Tritto et al., 1 Sep 2025).
Verifier Models that map execution traces, chains-of-thought, multi-modal inputs, or agent trajectories to binary or continuous outcome scores (Yu et al., 19 Dec 2024, Chen et al., 10 Nov 2025, Lin et al., 21 Oct 2025).

Supervision Data

Outcome Labels: Correct/incorrect solutions (Pan et al., 2023), passing/failing testcases in code (Yu et al., 19 Dec 2024), execution-equivalent SQL queries (Tritto et al., 1 Sep 2025), or binary agent success (Lin et al., 21 Oct 2025).
Preference Pairs: Human or surrogate feedback indicating which full response is preferred (Xie et al., 14 Jun 2025, Zhou et al., 10 Jun 2025).

Inference

Best-of-N Reranking: Given $K$ candidates, select $\hat{y} = \arg\max_{y} r_\theta(x, y)$ (Thatikonda et al., 27 Aug 2025, Jiang et al., 21 May 2025).
Verifier Filtering: ORM is used as a filter in self-evolving or iterative reflection (Chen et al., 10 Nov 2025).
Search Integration: ORM scores are blended or applied at the leaf nodes of search procedures (MCTS, beam search, etc.) (Yazdani et al., 28 Oct 2025, Yu et al., 19 Dec 2024).

3. Strengths, Limitations, and Empirical Behavior

Strengths

Data Efficiency: ORMs require only response-level labels, drastically reducing annotation overhead compared to PRMs (Yuan et al., 2 Dec 2024).
Robustness: Shown to be less susceptible to reward hacking and degenerate shortcut behavior (Gao et al., 19 Oct 2024, Ye et al., 3 Sep 2025).
Domain Generality: Trainable across text, code, SQL, vision–language, and agent domains with minimal adaptation (Tritto et al., 1 Sep 2025, Lin et al., 21 Oct 2025).
Test-time Scaling: ORMs benefit from increased candidate pool size at inference, often outperforming best-of-N baselines and majority voting, particularly on complex queries or logical inference tasks (Thatikonda et al., 27 Aug 2025, Tritto et al., 1 Sep 2025).

Limitations

Coarse-Grained Signal: ORMs provide no guidance on how an answer was derived, rewarding only the final outcome (Ye et al., 3 Sep 2025, Xie et al., 14 Jun 2025).
Process Blindness: Sound processes yielding wrong answers are penalized; flawed processes accidentally producing correct outcomes are rewarded equally (Ye et al., 3 Sep 2025).
Inconsistency for Partial Sequences: ORMs trained on outcomes lack score and preference consistency for prefixes, making them suboptimal for reward-guided search (RGS) or stepwise guidance (Xie et al., 14 Jun 2025, Zhou et al., 10 Jun 2025).
Domain Constraints: Some ORM flavors depend on executable or verifiable environments (e.g., code), limiting generality to purely abstract or interpretive tasks (Yu et al., 19 Dec 2024).

4. Methodological Extensions and Variants

Outcome-Refining Architectures

ORPS (Outcome-Refining Process Supervision): Unifies outcome and process signals via a tree-structured search with execution-based feedback and self-critique, integrating runtime correctness/efficiency at each reasoning step (Yu et al., 19 Dec 2024).

Implicit and Dual-Consistency Process Models

Implicit PRM via Likelihood Ratio: By parameterizing $r_\theta(y) = \beta \log \frac{\pi_\theta(y)}{\pi_\text{ref}(y)}$ , one recovers stepwise process rewards via token-level log-ratios, sidestepping explicit step labeling (Yuan et al., 2 Dec 2024).
SP-PRM and Score/Preference Consistency: PRMs that extend ORMs to partial prefixes with enforced agreement under score and surrogate preference constraints, improving guidance for RGS (Xie et al., 14 Jun 2025).
Intra-Trajectory Consistency (ICRM): Regularizes ORM training by enforcing prefix-wise reward consistency proportional to next-token generation probabilities, leveraging Bayesian decomposition of trajectory outcome (Zhou et al., 10 Jun 2025).

Energy-Based and Margin-Based ORM Learning

Energy-Based ORM (EORM): Treats outcome label logits as negative energies, enabling margin-based or Bradley–Terry ranking across candidate CoT trajectories (Jiang et al., 21 May 2025).

Multi-Agent and Multimodal Extensions

MASPRM + ORM: In multi-agent systems, process reward models guide search at each step, while ORM is reserved for terminal outcome scoring; end-to-end value mixing is controlled via a hyperparameter $\lambda$ (Yazdani et al., 28 Oct 2025).
Vision-Language ORM: For computer-using agents, ORMs are implemented by prompt-based classification with large vision–LLMs; strict ensemble voting increases reward reliability in high-stakes RL (Lin et al., 21 Oct 2025).

5. Empirical Results Across Domains

Domain/Task	Key Findings	Reference
Mathematical Reasoning	ORMs yield large gains on complex benchmarks (MATH: +18% (Pan et al., 2023)), but less on simple ones; energy-ORM achieves SOTA w/ few samples.	(Pan et al., 2023, Jiang et al., 21 May 2025)
Code Generation	ORPS (tree-structured ORM unifying stepwise reward and outcomes) yields +26.9% correctness, +42.2% code efficiency over CoT/outcome-only.	(Yu et al., 19 Dec 2024)
SQL Synthesis	ORM-based Best-of-N outperforms execution-only reranking or majority voting by 2–5% execution accuracy on BIRD/SPIDER.	(Tritto et al., 1 Sep 2025)
Deductive Logical Reasoning	Echo-augmented ORMs boost test-time accuracy 5–15pp over majority vote, especially as sample size increases.	(Thatikonda et al., 27 Aug 2025)
RL Training (Math)	ORM offers no benefit beyond sparse success reward during RL; PRMs require careful regularization to avoid reward hacking.	(Gao et al., 19 Oct 2024)
Multimodal Mathematical	ORM with step-indexed error diagnosis improves verifier accuracy by 4.5pp, yielding better iterative fine-tuning with reflection.	(Chen et al., 10 Nov 2025)
Multi-Agent Reasoning	Adding ORM to MASPRM-guided search yields further +2–4pp accuracy in multiagent math tasks.	(Yazdani et al., 28 Oct 2025)
Computer-Using Agents	ORM ensembles achieve 89.8% precision, 93.3% NPV; general-purpose VLMs outperform specialized CUAs.	(Lin et al., 21 Oct 2025)

Further findings include the marginal utility of response diversity, optimal scaling for instructions vs. responses, and cost/accuracy tradeoffs with reference models in inference (Yuan et al., 2 Dec 2024).

6. Practical Implications and Design Guidelines

Supervision Efficiency: ORM-based strategies should be the default when per-step supervision is cost-prohibitive. Implicit process evaluation can be recovered from appropriate parameterizations.
Inference-Time Alignment: ORMs enable reliable, compute-efficient candidate selection through reranking and ensemble methods, with evidence of superior performance in high-difficulty or multi-agent scenarios.
Search Guidance: For effective reward-guided search, ORM signals must be distilled into consistent process models, e.g., via SP-PRM or intra-trajectory regularization, to avoid granularity mismatch.
Reward Hacking Mitigation: ORMs exhibit robust resistance to reward hacking; the main source of vulnerability lies in finer-grained but noisy PRMs.
Ensemble and Prompt Design: In high-stakes evaluation, strict ensemble voting and prompt-template diversity enhance reliability of ORM predictions, especially for vision-language and agent tasks.

7. Limitations, Open Challenges, and Future Directions

While ORMs substantially improve upon majority-vote, execution-based heuristics, and brute-force sampling, notable challenges persist:

Process–Outcome Disconnect: ORMs provide little process-level feedback, necessitating auxiliary methods for fine-grained reasoning improvement (Ye et al., 3 Sep 2025, Xie et al., 14 Jun 2025).
Score Consistency: Addressing the inconsistency of ORM scores over partial sequences remains fundamental for their use in decoding and RGS; this motivates ongoing development of dual-consistency and regularization frameworks.
Generalization: While ORM-based methods are robust in math, code, logic, and SQL, further validation in open-domain generation and settings with ambiguous ground-truth is needed.
Human Alignment: Surrogate preference-based extensions offer promising directions; future work is poised to leverage human-in-the-loop data for partial-sequence rewards, richer consistency criteria, and multi-objective alignment (Xie et al., 14 Jun 2025).

In summary, Outcome-Supervised Reward Models provide a principled, data-efficient foundation for reward modeling in LLM-based reasoning, generation, and agentic tasks. When carefully integrated with process-level consistency mechanisms or domain-aligned architectures, ORMs offer a high-precision, low-overhead alternative to conventional process-intermediate supervision, and are central to many current state-of-the-art systems.