Papers
Topics
Authors
Recent
Search
2000 character limit reached

OAR-P: Outcome-grounded Advantage Reshaping

Updated 19 January 2026
  • OAR-P is a fine-grained credit assignment mechanism in reinforcement learning that uses token-level counterfactual perturbations to assess each token's contribution.
  • It replaces uniform advantage allocation with outcome-sensitive attributions, reducing premature entropy collapse and enhancing learning dynamics.
  • Empirical results demonstrate improved Pass@k scores on mathematical reasoning benchmarks, validating its high-fidelity token attribution approach.

OAR-P (Outcome-grounded Advantage Reshaping – Perturbation) is a fine-grained credit assignment mechanism in reinforcement learning, designed for optimizing mathematical reasoning chains in LLMs. Unlike conventional GRPO (Group Relative Policy Optimization), which broadcasts a uniform advantage to every token in a trajectory, OAR-P redistributes advantage signals at the token level, reflecting each token's true causal influence on the final model answer. This is accomplished via outcome sensitivity measured by counterfactual token perturbations. OAR-P instantiates a high-fidelity attribution signal that leads to more effective learning and improved performance in reasoning benchmarks.

1. Motivation for Fine-Grained Credit Assignment in Reasoning

Standard GRPO applies group-level rewards uniformly across all sequence tokens, failing to differentiate crucial reasoning steps ("logical pivots") from syntactic or irrelevant tokens. This coarse-grained strategy not only slows learning dynamics but can also trigger premature entropy collapse of the policy. In long, heterogeneous chains typical of mathematical reasoning, the majority of tokens are not pivotal for correctness. OAR-P addresses this challenge by reallocating advantage such that each token's update magnitude is proportional to its outcome sensitivity—effectively targeting the chain's cross-token credit assignment issue (Li et al., 12 Jan 2026).

2. Formal Definition and Mathematics of OAR-P

Let τ=(y1,...,yT)\tau = (y_1, ..., y_T) be a sampled trajectory. GRPO computes the normalized sequence advantage:

A(τ)=r(τ)rˉ1Gk(rkrˉ)2A(\tau) = \frac{r(\tau) - \bar{r}}{\sqrt{\tfrac{1}{G}\sum_k (r_k - \bar{r})^2}}

where rˉ=G1krk\bar{r} = G^{-1}\sum_k r_k is the group mean reward.

OAR-P constructs token-level advantages:

AOARP(τ)t=A(τ)  ω~tA_{OAR-P}(\tau)_t = A(\tau)\;\tilde{\omega}_t

ω~t\tilde{\omega}_t is the renormalized importance-weight for token tt (see Section 4), ensuring that the total "advantage mass" remains constant across the sequence.

3. Counterfactual Token Perturbation Attribution Mechanism

For each token yty_t in a trajectory, OAR-P quantifies influence via a counterfactual perturbation:

  • Compute the factual answer distribution: P=πθ(x,y1:T)P = \pi_\theta(\cdot | x, y_{1:T}).
  • Generate a perturbed trajectory y~(t)=(y1,...,yt1,[PAD],yt+1,...,yT)\tilde{y}^{(t)} = (y_1, ..., y_{t-1}, [PAD], y_{t+1}, ..., y_T) and calculate its answer distribution P(t)=πθ(x,y~(t))P^{(t)} = \pi_\theta(\cdot | x, \tilde{y}^{(t)}).
  • The raw importance for yty_t:

Itpert=DKL(PP(t))I_t^{pert} = D_{KL}(P \Vert P^{(t)})

Alternatively, use any scalar probe function ff:

Δt=f(yτ)f(yτ¬t)\Delta_t = f(y | \tau) - f(y | \tau_{\neg t})

where ff could be, for example, the log-likelihood of the correct answer span under the model.

4. Integration with Conservative Bi-Level Advantage Reshaping

The outcome sensitivities ItI_t are normalized:

Iˉt=log(1+It),I^t=IˉtminjIˉjmaxjIˉjminjIˉj+ϵ\bar{I}_t = \log(1 + I_t), \quad \hat{I}_t = \frac{\bar{I}_t - \min_j \bar{I}_j}{\max_j \bar{I}_j - \min_j \bar{I}_j + \epsilon}

Tokens are then weighted via a bi-level function (with threshold τ\tau and boost β\beta):

ω(I^t)={I^tτ+ϵ,I^t<τ(noise suppression) 1+βI^tτ1τ+ϵ,I^tτ(signal boosting)\omega(\hat{I}_t) = \begin{cases} \frac{\hat{I}_t}{\tau+\epsilon}, & \hat{I}_t < \tau \quad \text{(noise suppression)} \ 1+\beta \frac{\hat{I}_t-\tau}{1-\tau+\epsilon}, & \hat{I}_t \ge \tau \quad \text{(signal boosting)} \end{cases}

Final normalization preserves total advantage mass:

ω~t=ω(I^t)Tjω(I^j),tω~t=T\tilde{\omega}_t = \omega(\hat{I}_t)\cdot \frac{T}{\sum_j \omega(\hat{I}_j)}, \qquad \sum_t \tilde{\omega}_t = T

5. Training Workflow and Computational Complexity

OAR-P is integrated per GRPO update as follows:

  1. Sample GG trajectories and obtain rewards.
  2. Compute A(τ)A(\tau) per trajectory.
  3. For each trajectory:
    • Calculate PP, and for each token tt, generate perturbed P(t)P^{(t)}.
    • Compute ItpertI_t^{pert}, normalize, and derive ω~t\tilde{\omega}_t.
    • Set token-level advantages AOARP(τ)tA_{OAR-P}(\tau)_t.
  4. Run PPO-style policy gradient updates using these per-token advantages:

LOARP=Ei,t[min(ρt(i)AOARP(τ(i))t,  clip(ρt(i),1ε,1+ε)AOARP(τ(i))t)]\mathcal{L}_{OAR-P} = \mathbb{E}_{i,t}\left[\min\left(\rho^{(i)}_t A_{OAR-P}(\tau^{(i)})_t,\; \mathrm{clip}(\rho^{(i)}_t, 1-\varepsilon, 1+\varepsilon) A_{OAR-P}(\tau^{(i)})_t\right)\right]

The computational complexity adds O(T)O(T) forward passes per sampled trajectory. Batched implementation can parallelize all masked queries, yielding approximately 4.2× the cost of vanilla GRPO.

6. Empirical Validation: Mathematical Reasoning Benchmarks

OAR-P sets the empirical upper bound for critic-free token attribution in LLMs:

  • On Qwen2.5-7B, average Pass@kk improved from 51.3 to 53.7 (+2.4) on five mathematical reasoning sets (AIME25/24, AMC23, MATH500, GSM8K).
  • On Qwen2.5-Math-7B, Pass@kk increased from 57.3 to 59.5 (+2.2).
  • The gradient proxy variant (OAR-G) closes most of the gap with much lower computational overhead.
  • Training curves indicate improved stability and avoidance of entropy collapse.
  • Causal-token attribution strongly outperformed entropy-based weighting in recall experiments against Oracle masks (Li et al., 12 Jan 2026).

7. Implementation Details, Limitations, and Extensibility

  • Typical hyperparameters: threshold τ=0.4\tau = 0.4 (boost top 60%), boost coefficient β=2.0\beta = 2.0, numerical ϵ1e-6\epsilon \approx 1\text{e-6}.
  • Efficient batching of masked forward queries is essential to keep overhead manageable.
  • The surrogate outcome signal assumes that changes in the model’s answer distribution align with external (verifier) rewards; misalignment can occur.
  • OAR-P provides a high-fidelity upper bound on token attribution with more computational cost, while OAR-G leverages input-gradient sensitivity for faster approximation.
  • The OAR framework extends to any RL paradigm featuring delayed or non-differentiable rewards, provided suitable outcome probes are definable (e.g., next-token logits, answer span likelihood).

OAR-P thus enables high-resolution, outcome-sensitive assignment of policy gradients in autoregressive sequence models. By employing robust counterfactual perturbations and conservative reshaping, OAR-P advances the optimization frontier for critic-free mathematical reasoning tasks in large-scale LLMs (Li et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OAR-P.