Outcome-Grounded Advantage Reshaping (OAR)
- OAR is a mechanism that assigns token-level credit based on outcome sensitivity, effectively redistributing sequence rewards.
- It introduces two methods, OAR-P and OAR-G, which compute token importance via perturbation and gradient-based techniques to reduce gradient variance.
- OAR integrates with RL objectives, delivering impressive performance gains on mathematical reasoning benchmarks while preserving computational efficiency.
Outcome-Grounded Advantage Reshaping (OAR) is a fine-grained credit assignment mechanism in reinforcement learning for mathematical reasoning, formalized to redistribute training signal at the token level in accordance with the outcome sensitivity of individual reasoning steps. OAR augments sequence-level optimization approaches—such as Group Relative Policy Optimization (GRPO) and KL-regularized RL under binary feedback—with principled, outcome-conditioned weighting that enables more efficient policy learning and lower gradient variance, particularly in settings where only sparse or binary trajectory-level rewards are available (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).
1. Formal Motivation and Core Definitions
Standard RL and critic-free optimization methods rely heavily on sequence-level rewards propagated uniformly to all tokens or actions in a trajectory, leading to poor credit assignment and high gradient variance when reward signals are sparse or partial. OAR addresses this by distributing the sequence advantage to each token through a token-wise importance weight , which quantitatively reflects the token’s influence on the model’s final outcome distribution.
For a trajectory sampled by and for sparse reward , the sequence-level advantage in GRPO is:
where and are the mean and standard deviation over a group of sampled responses for a given prompt. In OAR, each token’s advantage is
with the constraint (sequence length), ensuring total gradient mass preservation (Li et al., 12 Jan 2026).
2. Instantiations of Token Importance: OAR-P and OAR-G
OAR provides two practical instantiations for token-level importance computation:
- OAR-P (Perturbation-Based Attribution): For token in a reasoning chain , the causal importance is estimated by masking and measuring the KL divergence between the original and perturbed final-answer distributions:
where and (Li et al., 12 Jan 2026).
- OAR-G (Gradient-Based Proxy): OAR-G injects Gaussian noise into token embeddings and uses the Gradient × Input attribution:
where and are the answer distributions before and after noise injection (Li et al., 12 Jan 2026).
Both methods produce raw importance scores, which are normalized across positions:
3. Bi-Level Advantage Reshaping and Gating Mechanism
Normalized token importance is input to a bi-level gating function , which ensures both suppression of low-impact tokens and boosting of pivotal steps, with continuity at threshold :
Renormalization yields final token weights:
Suppressing non-informative tokens reduces gradient variance, while boosting influential steps concentrates learning where most impactful (Li et al., 12 Jan 2026).
4. Integration with RL Objectives and GRPO
In GRPO and KL-regularized RL, OAR replaces the scalar sequence-level advantage in PPO-style surrogates with , so that updates are distributed in proportion to local outcome relevance:
with
This framework generalizes both outcome-centric RL (e.g., OREAL’s KL-regularized approach under binary feedback (Lyu et al., 10 Feb 2025)) and critic-free GRPO, making OAR broadly compatible.
Pseudocode for integration includes sampling groups of continuations, computing rewards and importance weights, reshaping advantages via OAR, and applying gradient updates (Li et al., 12 Jan 2026).
5. Reward-Shaping for Negative Trajectories
Sparse binary reward environments require principled reward-shaping for negative samples to maintain gradient consistency:
- Negative-side Reward Reshaping: In OREAL (Lyu et al., 10 Feb 2025), the reward for negative samples is set to (where is empirical pass rate), or via a leave-one-out centered reward
These adjustments restore proper credit assignment and ensure theoretical correctness in the presence of only outcome-level feedback.
6. Theoretical Properties and Complexity Considerations
OAR possesses several key theoretical guarantees:
- Variance Reduction: For tokens with zero outcome influence (), the corresponding gradient variance decreases to times its value under GRPO, with .
- Update Magnitude Preservation: The total “advantage mass” per sequence is fixed, avoiding instability from adaptive scaling.
- Complexity Tradeoffs: OAR-G adds one backward pass per sequence (1.4× the cost of GRPO), while OAR-P adds forward passes (4.2× cost in batch settings). Empirically, OAR-G achieves most of OAR-P’s benefits at far lower cost.
7. Empirical Results in Mathematical Reasoning
Extensive experiments on mathematical reasoning datasets—including AIME25, AIME24, AMC23, MATH500, and GSM8K—demonstrate OAR’s efficacy. Key results (Pass@1/Pass@32) on the Qwen2.5-7B-Base model are presented below (Li et al., 12 Jan 2026):
| Method | AIME25 | AIME24 | AMC23 | MATH500 | GSM8K | Avg Pass@1 |
|---|---|---|---|---|---|---|
| GRPO | 9.4 | 13.5 | 59.8 | 75.8 | 90.5 | 51.3 |
| GRPO + OAR-G | 11.8 | 14.8 | 61.4 | 78.7 | 91.4 | 53.2 |
| GRPO + OAR-P | 12.2 | 15.2 | 61.9 | 78.4 | 92.0 | 53.7 |
OAR-G consistently outperforms baseline GRPO (+2.4 points on average), with OAR-P providing an upper bound (additional +0.5 points). OAR accelerates reward convergence while maintaining policy entropy, avoiding premature collapse. Computational overhead is moderate: OAR-G requires only ~40% greater time per token than GRPO.
8. Significance and Outlook
OAR provides a theoretically principled and practically efficient mechanism for fine-grained credit assignment in both critic-free and KL-regularized RL frameworks, resolving core limitations of sparse outcome-based training in long reasoning chains. Its successful application in mathematical reasoning tasks demonstrates consistent performance gains and suggests potential adoption in other domains characterized by sparse, delayed rewards and intricate credit structures (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).