Outcome-Based Reward Mechanism
- Outcome-Based Reward Mechanism is a method where rewards are defined solely by the final result of a task, bypassing intermediate evaluations.
- It is widely applied in reinforcement learning and large language model reasoning using binary terminal rewards, deterministic verifiers, and dense score formulations.
- Its main challenge is sparse signal credit assignment, which has spurred research into hybrid methods incorporating sub-goal verification and process supervision.
An outcome-based reward mechanism is a reward specification in which supervision, utility, or payment is determined from the realized result of a complete trajectory, task, or public outcome rather than from direct evaluation of intermediate steps. In contemporary LLM, MLLM, and agentic RL literature, this appears as binary terminal rewards such as for success and $0$ for failure, deterministic verifiers for final-answer correctness, trajectory-level success predictors over full computer-using traces, and scalar rewards such as that score whether a generated outcome matches the target class (Liu et al., 28 Sep 2025, Ye et al., 3 Sep 2025, Lin et al., 21 Oct 2025, Hou et al., 22 Mar 2026). Its central attraction is that the reward is often easy to verify at the endpoint, whereas stepwise supervision is expensive or unavailable; its central difficulty is credit assignment, because intermediate reasoning, search, or tool-use behavior is only indirectly constrained.
1. Formal definitions and mathematical forms
In agentic RL, the canonical formulation is episodic and binary. Under GRPO-style analysis, the reward is defined as
with the associated value functions
and advantage
This is the simplest form of outcome-only supervision: all gradient information is mediated through the final success event (Liu et al., 28 Sep 2025).
In LLM reasoning, the same mechanism is often written as a verifier over a prompt–response pair. One formulation treats an Outcome Reward Model as a deterministic verifier
returning if the final answer is correct and $0$0 otherwise; GRPO then standardizes these trajectory-level rewards within a rollout group to form advantages for policy optimization (Ye et al., 3 Sep 2025). A closely related formulation in deductive reasoning uses a binary classifier over a problem $0$1 and a candidate chain-of-thought $0$2, with
$0$3
and reward $0$4, enabling test-time reranking of multiple sampled traces (Thatikonda et al., 27 Aug 2025).
Outcome-based reward mechanisms also appear as trajectory-level evaluators rather than policy-training signals. In computer-using agents, an ORM takes a full trajectory
$0$5
and predicts either a binary task-success label
$0$6
or a success probability $0$7 thresholded at $0$8 (Lin et al., 21 Oct 2025). In tool-calling, the reward model is often preference-based: ToolRM defines a scalar score $0$9 over a state 0 and candidate function-calling sequence 1, and trains with a Bradley–Terry objective over preferred and dispreferred final calls (Agarwal et al., 15 Sep 2025).
A broader variant replaces binary correctness with a differentiable scalar outcome score. RLVC, for generative zero-shot learning, freezes a classifier 2, computes
3
and sets the outcome-based reward to
4
Here the reward still depends only on the final synthesized feature’s class correctness, but it is no longer restricted to 5 (Hou et al., 22 Mar 2026).
2. Role in policy optimization and test-time selection
Outcome-based reward mechanisms are a standard foundation for RL with verifiable rewards in mathematical reasoning. OREAL models the LLM as a stochastic policy in a deterministic MDP with zero reward at every nonterminal step and terminal reward
6
Within a KL-regularized objective, the paper proves that behavior cloning on positive trajectories from best-of-7 sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments, and then adds reward shaping for negative samples and a token-level reward model for long trajectories. Reported results include 8 pass@1 on MATH-500 for a 9B model and 0 pass@1 for OREAL-32B (Lyu et al., 10 Feb 2025).
The same reward class is widely used for test-time scaling rather than online RL. In deductive logical reasoning, ORMs are trained as binary classifiers over complete chain-of-thought traces and then used to rerank 1 sampled candidates. The training pipeline combines standard CoT sampling with Echo-CoT negatives, where the model is deliberately prompted toward incorrect reasoning and filtered to keep nontrivial flawed traces. Reported results include, for GPT-4o best-of-8 selection, ProverQA-Hard improving from a majority-vote baseline of 2 to 3 with ORM-CoT4 and 5 with ORM-EcCoT6; JustLogic improving from 7 to 8 and then 9; and FOLIO reaching 0 with ORM-EcCoT1 (Thatikonda et al., 27 Aug 2025).
In search-augmented agents, outcome reward is often one stage in a larger optimization schedule. DeSA defines a binary exact-match reward
2
for the final answer, but postpones its use until after a search-skill stage driven by retrieval recall-based reward. On Qwen2.5-3B-Instruct, Search-R1 yields 3 deficiency rate, 4 recall, and 5 average EM, whereas DeSA yields 6, 7, and 8, respectively; on Qwen2.5-7B-Instruct, the paper reports a 9 relative EM gain, 0 versus 1 (Wang et al., 6 Oct 2025). This suggests that outcome-based reward remains effective for answer optimization, but not necessarily for shaping intermediate search behavior.
3. Credit assignment, sparsity, and recurrent failure modes
The defining weakness of outcome-based reward mechanisms is sparsity. SGVR characterizes traditional outcome-based reward as “extremely sparse,” noting three specific pathologies: models can “guess” a correct final number despite flawed intermediate logic, small arithmetic slips can zero out an otherwise valid proof, and no feedback is provided on which deductive steps are wrong (Chen et al., 8 Jan 2026). PROF makes the same point in RLVR for mathematical reasoning: two trajectories, one logically sound and one flawed, receive the same reward if their final answers coincide. In the reported Qwen2.5-Math training runs, among correct trajectories filtered out by PROF, 2 were judged by a strong LLM-judge to contain fundamentally flawed reasoning (Ye et al., 3 Sep 2025).
Multiple-choice multimodal RL exposes a related phenomenon: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning. SCS formalizes this by observing that if 3 is a truly correct CoT and 4 an incorrect one, outcome reward alone cannot distinguish them whenever both terminate in the correct option. The paper reports that outcome-reward training yields only modest accuracy gains of 5 percentage points in multiple-choice mode versus 6 percentage points in open-ended mode, and truncation-resampling experiments show that many prefixes do not stably determine the final option (Wang et al., 13 Nov 2025).
In tool-using and search-augmented agents, the same coarse supervision leads to systematic behavior deficits. DeSA identifies three recurrent deficiencies under exact-match-only training: No Search, Duplicate Queries, and Invalid Searches. Trajectories exhibiting any deficiency have markedly lower retrieval recall and EM than clean trajectories: for Qwen2.5-3B, recall 7 versus 8 and EM 9 versus 0; for Qwen2.5-7B, recall 1 versus 2 and EM 3 versus 4. Among recall failures, 5 for Qwen2.5-3B and 6 for Qwen2.5-7B contain at least one deficiency (Wang et al., 6 Oct 2025).
A notable controversy concerns “reward miscalibration” in GRPO. One line of argument claims that outcome-based rewards wrongly reinforce flawed middle steps. The counteranalysis in “Rethinking Reward Miscalibration of GRPO in Agentic RL” derives
7
so flawed actions should receive negative expected advantage under a 8 outcome reward. The paper then identifies gradient coupling, not the reward definition itself, as the key issue: when prompts and action spaces are highly similar, cross-sample gradient interference can still strengthen flawed actions. To mitigate this, it adds a binary classification head and optimizes
9
thereby separating embeddings of good and bad actions (Liu et al., 28 Sep 2025).
4. Outcome-linked dense rewards and hybridization with process supervision
A major research direction retains outcome-based supervision but augments, reshapes, or decomposes it into denser signals. The abstract of LeTS describes a framework for retrieval-augmented generation that “hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG,” using a process-level reward module to mitigate the unawareness of intermediate reasoning steps “without additional annotation.” The abstract further states that extensive experiments demonstrate generalization and inference efficiency across various RAG benchmarks (Zhang et al., 23 May 2025).
SGVR replaces a single binary terminal reward with dense sub-goal verification. For a problem with 0 numeric sub-goals, the instance-level Skeleton Rate is
1
and the trajectory reward is the fraction of correctly verified sub-goals. The paper reports 2 on geometric reasoning, 3 on general mathematics, and 4 on general reasoning, with in-domain SR improving from 5 to 6, SC from 7 to 8, and CR from 9 to 0 (Chen et al., 8 Jan 2026).
Outcome-supervised process reward modeling addresses the same problem from the opposite direction: keep only final-answer labels, but infer stepwise responsibility. LCA formulates the reasoning chain as a Multiple Instance Learning problem with bag label
1
adopts the “Weakest Link Assignment” principle, and introduces Softmax-Weighted-Sum pooling to approximate min-style aggregation while keeping every prefix active in training. The paper proves Bayes consistency under a mild 2-Assumption and reports ProcessBench F1 gains of 3–4 points over the best outcome-supervised PRM, together with the highest average best-of-5 accuracy across three generators on MATH-500 (Jia et al., 26 Jun 2026).
CRM links process reward to outcome probability through a latent first-error variable. With conditional step error probability
6
the survival probability is
7
and the shaped step reward is
8
Because 9, the dense reward is exactly aligned with the model’s probability of a correct final answer. CRM is reported to outperform PRM, PQM, and IPRM in Best-of-0, beam search, and RL, while remaining more robust to reward hacking (Zhang et al., 30 Sep 2025).
Several further methods use outcome reward as an anchor while constraining how process reward enters training. PRPO constructs sparse outcome reward 1 plus length penalty, converts PRM segment scores into normalized token-level advantages, and then aligns the process distribution with outcome advantages by a location-parameter shift; on MATH500, this improves Qwen2.5-Math-1.5B from 2 to 3 over GRPO with eight rollouts and no value network (Ding et al., 12 Jan 2026). PROF does not blend PRM and ORM directly; instead it filters rollouts whose average process value is inconsistent with terminal correctness, reporting over 4 accuracy improvement versus blending approaches and up to 5 percentage points on intermediate step-value evaluations (Ye et al., 3 Sep 2025). BARS provides a theoretical counterpart, converting sparse outcome-based rewards into procedure-based signals and proving 6-accuracy in 7 iterations with 8 dynamic regret (Chitra, 14 Apr 2025).
5. Domain-specific instantiations
Outcome-based reward mechanisms are now specialized for distinct action spaces and evaluators. In computer-using agents, CUARewardBench evaluates ORMs and PRMs on 9 human-annotated trajectories from $0$00 software categories and $0$01 policy models. ORM precision and NPV are the primary metrics. The best single-model ORM is GLM-4.5V-106B under the “sewsm” prompt at $0$02 precision and $0$03 NPV, while the Unanimous Prompt Ensemble using Qwen2.5VL-32B and GLM-4.5V-106B variants achieves $0$04 precision and $0$05 NPV by predicting only under unanimous agreement (Lin et al., 21 Oct 2025).
In function-calling, ToolRM argues that natural-language reward models miss tool-specific correctness signals such as function selection and parameter values. It introduces FC-RewardBench with $0$06 single-turn examples and trains pairwise outcome reward models on $0$07k synthesized triples. Reported pairwise accuracies are $0$08 for ToolRM-1.5B, $0$09 for ToolRM-7B, and $0$10 for ToolRM-14B. Across seven out-of-domain benchmarks, the models enable up to $0$11 average improvement in downstream task performance and support reward-guided data filtering (Agarwal et al., 15 Sep 2025).
In generative zero-shot learning, RLVC treats the feature generator as a policy and uses the classifier-derived reward $0$12 to encourage synthesis of task-relevant features. The method introduces class-wise visual prototypes, a prototype-distillation loss, alternating updates between adversarial and RL objectives, and a cold-start strategy that delays RL until after an initial adversarial phase. On three prevalent ZSL benchmarks, RLVC is reported to achieve state-of-the-art results with a $0$13 gain (Hou et al., 22 Mar 2026).
Remote sensing provides a hybrid but still outcome-centered example. RS-HyRe-R1 combines a structural activation reward, a task-specific perception correctness reward for REC, OVD, and VQA, and a path-evolution reward that penalizes repetitive reasoning among valid rollouts. The perception correctness component is outcome-level: it depends on IoU thresholds for REC, F$0$14-style matching for OVD, and exact normalized answer match for VQA. With only $0$15B parameters, the method reports $0$16 [email protected] and $0$17 [email protected] on REC, [email protected] $0$18 and mAP@[0.5:0.95] $0$19 on OVD, and $0$20 Pass@1 on VQA, together with zero-shot gains of $0$21, $0$22, and $0$23 over the second-best model on VQA, OVD, and REC (Zhou et al., 19 Apr 2026).
6. Incentive mechanisms beyond RL
Outcome-based reward mechanisms also arise in mechanism design, where payment is conditioned on realized outputs rather than directly observed effort. In Peer Output Agreement, each worker $0$24 is matched with a random peer $0$25 and receives bonus $0$26 if $0$27, yielding utility
$0$28
Under symmetric threshold strategies $0$29, the marginal worker satisfies
$0$30
The paper derives the optimal reward level when the requester knows the cost distribution and proposes sequential mechanisms for learning the distribution when it is unknown (Liu et al., 2016).
RewardRating uses a stock-market-like rating mechanism in which users buy “coins” associated with ratings and receive rewards based on future minting events. For a reviewer who invested $0$31 coins in rating $0$32, total payoff is
$0$33
Because the profit $0$34 from each new coin is redistributed exactly among existing stakeholders, the mechanism is strictly budget-balanced; because the distance-weight function $0$35 decreases with $0$36, truthful and quality-aligned ratings are favored over extreme fake ratings (Vakilinia et al., 2021).
Mechanism design for LLM fine-tuning provides an even more abstract outcome rule. Given reported reward models $0$37 and weights $0$38, the provider selects
$0$39
then charges an affine-maximizer payment equal to the externality imposed on the others’ affine social welfare. The resulting mechanism is dominant-strategy incentive compatible and individually rational, and remains approximately DSIC under bounded perturbations of the reported reward models (Sun et al., 2024). This suggests that outcome-based reward mechanisms are not limited to RL reward shaping; they also define incentive-compatible payoff rules for collective model selection.
Across these formulations, the recurrent pattern is stable: outcome-based reward offers a reliable supervisory primitive whenever endpoint verification is available, but it is systematically too coarse to resolve temporal causality, intermediate correctness, or strategic behavior on its own. Much of the recent literature therefore preserves the endpoint signal while adding sub-goal verification, learnable credit assignment, consistency filtering, auxiliary classification, or incentive payments that better align local updates with global outcomes (Jia et al., 26 Jun 2026, Zhang et al., 30 Sep 2025, Liu et al., 28 Sep 2025).