Papers
Topics
Authors
Recent
Search
2000 character limit reached

Outcome-Grounded Advantage Reshaping (OAR)

Updated 19 January 2026
  • OAR is a mechanism that assigns token-level credit based on outcome sensitivity, effectively redistributing sequence rewards.
  • It introduces two methods, OAR-P and OAR-G, which compute token importance via perturbation and gradient-based techniques to reduce gradient variance.
  • OAR integrates with RL objectives, delivering impressive performance gains on mathematical reasoning benchmarks while preserving computational efficiency.

Outcome-Grounded Advantage Reshaping (OAR) is a fine-grained credit assignment mechanism in reinforcement learning for mathematical reasoning, formalized to redistribute training signal at the token level in accordance with the outcome sensitivity of individual reasoning steps. OAR augments sequence-level optimization approaches—such as Group Relative Policy Optimization (GRPO) and KL-regularized RL under binary feedback—with principled, outcome-conditioned weighting that enables more efficient policy learning and lower gradient variance, particularly in settings where only sparse or binary trajectory-level rewards are available (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).

1. Formal Motivation and Core Definitions

Standard RL and critic-free optimization methods rely heavily on sequence-level rewards propagated uniformly to all tokens or actions in a trajectory, leading to poor credit assignment and high gradient variance when reward signals are sparse or partial. OAR addresses this by distributing the sequence advantage AseqA_{\text{seq}} to each token tt through a token-wise importance weight ω~t\tilde\omega_t, which quantitatively reflects the token’s influence on the model’s final outcome distribution.

For a trajectory y=(y1,...,yT)y = (y_1, ..., y_T) sampled by πθ\pi_\theta and for sparse reward r(y)r(y), the sequence-level advantage in GRPO is:

Aseq=r(y)μrσrA_{\text{seq}} = \frac{r(y) - \mu_r}{\sigma_r}

where μr\mu_r and σr\sigma_r are the mean and standard deviation over a group of sampled responses for a given prompt. In OAR, each token’s advantage is

AtOAR=Aseqω~tA^{\text{OAR}}_t = A_{\text{seq}} \cdot \tilde\omega_t

with the constraint tω~t=T\sum_t \tilde\omega_t = T (sequence length), ensuring total gradient mass preservation (Li et al., 12 Jan 2026).

2. Instantiations of Token Importance: OAR-P and OAR-G

OAR provides two practical instantiations for token-level importance computation:

  • OAR-P (Perturbation-Based Attribution): For token tt in a reasoning chain yy, the causal importance ItpertI_t^{\mathrm{pert}} is estimated by masking yty_t and measuring the KL divergence between the original and perturbed final-answer distributions:

Itpert=DKL(PfinalPfinal(t))I_t^{\mathrm{pert}} = D_{\mathrm{KL}}\left(P_{\text{final}} \| P_{\text{final}}^{(t)}\right)

where Pfinal=πθ(x,y)P_{\text{final}} = \pi_\theta(\cdot|x, y) and Pfinal(t)=πθ(x,y~(t))P_{\text{final}}^{(t)} = \pi_\theta(\cdot|x, \tilde y^{(t)}) (Li et al., 12 Jan 2026).

  • OAR-G (Gradient-Based Proxy): OAR-G injects Gaussian noise into token embeddings and uses the Gradient × Input attribution:

gt=etDKL(P0Pϵ),Itgrad=gt,etg_t = \nabla_{e_t} D_{\mathrm{KL}}(P_0 \Vert P_\epsilon), \qquad I_t^{\mathrm{grad}} = | \langle g_t, e_t \rangle |

where P0P_0 and PϵP_\epsilon are the answer distributions before and after noise injection (Li et al., 12 Jan 2026).

Both methods produce raw importance scores, which are normalized across positions:

Iˉt=log(1+It),I^t=IˉtminjIˉjmaxjIˉjminjIˉj+ϵ\bar I_t = \log(1 + I_t), \qquad \hat I_t = \frac{\bar I_t - \min_j \bar I_j}{\max_j \bar I_j - \min_j \bar I_j + \epsilon}

3. Bi-Level Advantage Reshaping and Gating Mechanism

Normalized token importance I^t\hat I_t is input to a bi-level gating function ω(I^t)\omega(\hat I_t), which ensures both suppression of low-impact tokens and boosting of pivotal steps, with continuity at threshold τ\tau:

ω(I^t)={I^tτ+ϵ,I^t<τ 1+βI^tτ1τ+ϵ,I^tτ\omega(\hat I_t) = \begin{cases} \frac{\hat I_t}{\tau+\epsilon}, & \hat I_t < \tau \ 1 + \beta \frac{\hat I_t-\tau}{1-\tau+\epsilon}, & \hat I_t \geq \tau \end{cases}

Renormalization yields final token weights:

ω~t=ω(I^t)Tjω(I^j)\tilde\omega_t = \omega(\hat I_t) \cdot \frac{T}{\sum_j \omega(\hat I_j)}

Suppressing non-informative tokens reduces gradient variance, while boosting influential steps concentrates learning where most impactful (Li et al., 12 Jan 2026).

4. Integration with RL Objectives and GRPO

In GRPO and KL-regularized RL, OAR replaces the scalar sequence-level advantage in PPO-style surrogates with AtOARA_t^{\mathrm{OAR}}, so that updates are distributed in proportion to local outcome relevance:

LOAR=Ei,t[min(ρt(i)AtOAR,clip(ρt(i),1ϵ,1+ϵ)AtOAR)]\mathcal{L}_\text{OAR} = \mathbb{E}_{i,t} \left[\min\left( \rho_t^{(i)} A_t^{\mathrm{OAR}}, \mathrm{clip}\left(\rho_t^{(i)}, 1-\epsilon, 1+\epsilon\right) A_t^{\mathrm{OAR}} \right)\right]

with

ρt(i)=πθ(yt(i))πθold(yt(i))\rho_t^{(i)} = \frac{\pi_\theta(y_t^{(i)}|\cdots)}{\pi_\theta^{\text{old}}(y_t^{(i)}|\cdots)}

This framework generalizes both outcome-centric RL (e.g., OREAL’s KL-regularized approach under binary feedback (Lyu et al., 10 Feb 2025)) and critic-free GRPO, making OAR broadly compatible.

Pseudocode for integration includes sampling groups of continuations, computing rewards and importance weights, reshaping advantages via OAR, and applying gradient updates (Li et al., 12 Jan 2026).

5. Reward-Shaping for Negative Trajectories

Sparse binary reward environments require principled reward-shaping for negative samples to maintain gradient consistency:

  • Negative-side Reward Reshaping: In OREAL (Lyu et al., 10 Feb 2025), the reward for negative samples is set to Δr(τ)=1p\Delta r(\tau^-) = 1-p (where pp is empirical pass rate), or via a leave-one-out centered reward

RRLOO(τi)=r(τi)1N1jir(τj)R_{\text{RLOO}}(\tau_i) = r(\tau_i) - \frac{1}{N-1} \sum_{j \neq i} r(\tau_j)

These adjustments restore proper credit assignment and ensure theoretical correctness in the presence of only outcome-level feedback.

6. Theoretical Properties and Complexity Considerations

OAR possesses several key theoretical guarantees:

  • Variance Reduction: For tokens with zero outcome influence (Ik=0I_k=0), the corresponding gradient variance decreases to ϵ2\epsilon^2 times its value under GRPO, with ϵ1\epsilon \ll 1.
  • Update Magnitude Preservation: The total “advantage mass” per sequence is fixed, avoiding instability from adaptive scaling.
  • Complexity Tradeoffs: OAR-G adds one backward pass per sequence (\sim1.4× the cost of GRPO), while OAR-P adds O(T)O(T) forward passes (\sim4.2× cost in batch settings). Empirically, OAR-G achieves most of OAR-P’s benefits at far lower cost.

7. Empirical Results in Mathematical Reasoning

Extensive experiments on mathematical reasoning datasets—including AIME25, AIME24, AMC23, MATH500, and GSM8K—demonstrate OAR’s efficacy. Key results (Pass@1/Pass@32) on the Qwen2.5-7B-Base model are presented below (Li et al., 12 Jan 2026):

Method AIME25 AIME24 AMC23 MATH500 GSM8K Avg Pass@1
GRPO 9.4 13.5 59.8 75.8 90.5 51.3
GRPO + OAR-G 11.8 14.8 61.4 78.7 91.4 53.2
GRPO + OAR-P 12.2 15.2 61.9 78.4 92.0 53.7

OAR-G consistently outperforms baseline GRPO (+2.4 points on average), with OAR-P providing an upper bound (additional +0.5 points). OAR accelerates reward convergence while maintaining policy entropy, avoiding premature collapse. Computational overhead is moderate: OAR-G requires only ~40% greater time per token than GRPO.

8. Significance and Outlook

OAR provides a theoretically principled and practically efficient mechanism for fine-grained credit assignment in both critic-free and KL-regularized RL frameworks, resolving core limitations of sparse outcome-based training in long reasoning chains. Its successful application in mathematical reasoning tasks demonstrates consistent performance gains and suggests potential adoption in other domains characterized by sparse, delayed rewards and intricate credit structures (Lyu et al., 10 Feb 2025, Li et al., 12 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Outcome-grounded Advantage Reshaping (OAR).