Outcome-Grounded Advantage Reshaping (OAR)

Updated 19 January 2026

OAR is a mechanism that assigns token-level credit based on outcome sensitivity, effectively redistributing sequence rewards.
It introduces two methods, OAR-P and OAR-G, which compute token importance via perturbation and gradient-based techniques to reduce gradient variance.
OAR integrates with RL objectives, delivering impressive performance gains on mathematical reasoning benchmarks while preserving computational efficiency.

Outcome-Grounded Advantage Reshaping (OAR) is a fine-grained credit assignment mechanism in reinforcement learning for mathematical reasoning, formalized to redistribute training signal at the token level in accordance with the outcome sensitivity of individual reasoning steps. OAR augments sequence-level optimization approaches—such as Group Relative Policy Optimization (GRPO) and KL-regularized RL under binary feedback—with principled, outcome-conditioned weighting that enables more efficient policy learning and lower gradient variance, particularly in settings where only sparse or binary trajectory-level rewards are available (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).

1. Formal Motivation and Core Definitions

Standard RL and critic-free optimization methods rely heavily on sequence-level rewards propagated uniformly to all tokens or actions in a trajectory, leading to poor credit assignment and high gradient variance when reward signals are sparse or partial. OAR addresses this by distributing the sequence advantage $A_{\text{seq}}$ to each token $t$ through a token-wise importance weight $\tilde\omega_t$ , which quantitatively reflects the token’s influence on the model’s final outcome distribution.

For a trajectory $y = (y_1, ..., y_T)$ sampled by $\pi_\theta$ and for sparse reward $r(y)$ , the sequence-level advantage in GRPO is:

$A_{\text{seq}} = \frac{r(y) - \mu_r}{\sigma_r}$

where $\mu_r$ and $\sigma_r$ are the mean and standard deviation over a group of sampled responses for a given prompt. In OAR, each token’s advantage is

$A^{\text{OAR}}_t = A_{\text{seq}} \cdot \tilde\omega_t$

with the constraint $\sum_t \tilde\omega_t = T$ (sequence length), ensuring total gradient mass preservation (Li et al., 12 Jan 2026).

2. Instantiations of Token Importance: OAR-P and OAR-G

OAR provides two practical instantiations for token-level importance computation:

OAR-P (Perturbation-Based Attribution): For token $t$ in a reasoning chain $y$ , the causal importance $I_t^{\mathrm{pert}}$ is estimated by masking $y_t$ and measuring the KL divergence between the original and perturbed final-answer distributions:

$I_t^{\mathrm{pert}} = D_{\mathrm{KL}}\left(P_{\text{final}} \| P_{\text{final}}^{(t)}\right)$

where $P_{\text{final}} = \pi_\theta(\cdot|x, y)$ and $P_{\text{final}}^{(t)} = \pi_\theta(\cdot|x, \tilde y^{(t)})$ (Li et al., 12 Jan 2026).

OAR-G (Gradient-Based Proxy): OAR-G injects Gaussian noise into token embeddings and uses the Gradient × Input attribution:

$g_t = \nabla_{e_t} D_{\mathrm{KL}}(P_0 \Vert P_\epsilon), \qquad I_t^{\mathrm{grad}} = | \langle g_t, e_t \rangle |$

where $P_0$ and $P_\epsilon$ are the answer distributions before and after noise injection (Li et al., 12 Jan 2026).

Both methods produce raw importance scores, which are normalized across positions:

$\bar I_t = \log(1 + I_t), \qquad \hat I_t = \frac{\bar I_t - \min_j \bar I_j}{\max_j \bar I_j - \min_j \bar I_j + \epsilon}$

3. Bi-Level Advantage Reshaping and Gating Mechanism

Normalized token importance $\hat I_t$ is input to a bi-level gating function $\omega(\hat I_t)$ , which ensures both suppression of low-impact tokens and boosting of pivotal steps, with continuity at threshold $\tau$ :

$\omega(\hat I_t) = \begin{cases} \frac{\hat I_t}{\tau+\epsilon}, & \hat I_t < \tau \ 1 + \beta \frac{\hat I_t-\tau}{1-\tau+\epsilon}, & \hat I_t \geq \tau \end{cases}$

Renormalization yields final token weights:

$\tilde\omega_t = \omega(\hat I_t) \cdot \frac{T}{\sum_j \omega(\hat I_j)}$

Suppressing non-informative tokens reduces gradient variance, while boosting influential steps concentrates learning where most impactful (Li et al., 12 Jan 2026).

4. Integration with RL Objectives and GRPO

In GRPO and KL-regularized RL, OAR replaces the scalar sequence-level advantage in PPO-style surrogates with $A_t^{\mathrm{OAR}}$ , so that updates are distributed in proportion to local outcome relevance:

$\mathcal{L}_\text{OAR} = \mathbb{E}_{i,t} \left[\min\left( \rho_t^{(i)} A_t^{\mathrm{OAR}}, \mathrm{clip}\left(\rho_t^{(i)}, 1-\epsilon, 1+\epsilon\right) A_t^{\mathrm{OAR}} \right)\right]$

with

$\rho_t^{(i)} = \frac{\pi_\theta(y_t^{(i)}|\cdots)}{\pi_\theta^{\text{old}}(y_t^{(i)}|\cdots)}$

This framework generalizes both outcome-centric RL (e.g., OREAL’s KL-regularized approach under binary feedback (Lyu et al., 10 Feb 2025)) and critic-free GRPO, making OAR broadly compatible.

Pseudocode for integration includes sampling groups of continuations, computing rewards and importance weights, reshaping advantages via OAR, and applying gradient updates (Li et al., 12 Jan 2026).

5. Reward-Shaping for Negative Trajectories

Sparse binary reward environments require principled reward-shaping for negative samples to maintain gradient consistency:

Negative-side Reward Reshaping: In OREAL (Lyu et al., 10 Feb 2025), the reward for negative samples is set to $\Delta r(\tau^-) = 1-p$ (where $p$ is empirical pass rate), or via a leave-one-out centered reward

$R_{\text{RLOO}}(\tau_i) = r(\tau_i) - \frac{1}{N-1} \sum_{j \neq i} r(\tau_j)$

These adjustments restore proper credit assignment and ensure theoretical correctness in the presence of only outcome-level feedback.

6. Theoretical Properties and Complexity Considerations

OAR possesses several key theoretical guarantees:

Variance Reduction: For tokens with zero outcome influence ( $I_k=0$ ), the corresponding gradient variance decreases to $\epsilon^2$ times its value under GRPO, with $\epsilon \ll 1$ .
Update Magnitude Preservation: The total “advantage mass” per sequence is fixed, avoiding instability from adaptive scaling.
Complexity Tradeoffs: OAR-G adds one backward pass per sequence ( $\sim$ 1.4× the cost of GRPO), while OAR-P adds $O(T)$ forward passes ( $\sim$ 4.2× cost in batch settings). Empirically, OAR-G achieves most of OAR-P’s benefits at far lower cost.

7. Empirical Results in Mathematical Reasoning

Extensive experiments on mathematical reasoning datasets—including AIME25, AIME24, AMC23, MATH500, and GSM8K—demonstrate OAR’s efficacy. Key results (Pass@1/Pass@32) on the Qwen2.5-7B-Base model are presented below (Li et al., 12 Jan 2026):

Method	AIME25	AIME24	AMC23	MATH500	GSM8K	Avg Pass@1
GRPO	9.4	13.5	59.8	75.8	90.5	51.3
GRPO + OAR-G	11.8	14.8	61.4	78.7	91.4	53.2
GRPO + OAR-P	12.2	15.2	61.9	78.4	92.0	53.7

OAR-G consistently outperforms baseline GRPO (+2.4 points on average), with OAR-P providing an upper bound (additional +0.5 points). OAR accelerates reward convergence while maintaining policy entropy, avoiding premature collapse. Computational overhead is moderate: OAR-G requires only ~40% greater time per token than GRPO.

8. Significance and Outlook

OAR provides a theoretically principled and practically efficient mechanism for fine-grained credit assignment in both critic-free and KL-regularized RL frameworks, resolving core limitations of sparse outcome-based training in long reasoning chains. Its successful application in mathematical reasoning tasks demonstrates consistent performance gains and suggests potential adoption in other domains characterized by sparse, delayed rewards and intricate credit structures (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (2025)

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Outcome-grounded Advantage Reshaping (OAR).

Outcome-Grounded Advantage Reshaping (OAR)

1. Formal Motivation and Core Definitions

2. Instantiations of Token Importance: OAR-P and OAR-G

3. Bi-Level Advantage Reshaping and Gating Mechanism

4. Integration with RL Objectives and GRPO

5. Reward-Shaping for Negative Trajectories

6. Theoretical Properties and Complexity Considerations

7. Empirical Results in Mathematical Reasoning

8. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Outcome-Grounded Advantage Reshaping (OAR)

1. Formal Motivation and Core Definitions

2. Instantiations of Token Importance: OAR-P and OAR-G

3. Bi-Level Advantage Reshaping and Gating Mechanism

4. Integration with RL Objectives and GRPO

5. Reward-Shaping for Negative Trajectories

6. Theoretical Properties and Complexity Considerations

7. Empirical Results in Mathematical Reasoning

8. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research