Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Hint Aligned GRPO (SAGE)

Updated 5 February 2026
  • The paper introduces SAGE, a framework that injects compact privileged hints to overcome advantage collapse in sparse-reward settings.
  • SAGE employs online self-hinting and adaptive curriculum scheduling to maintain effective gradient signals and enhance policy learning.
  • Empirical results demonstrate SAGE's gains of 2–6 points in accuracy and improved exploration compared to baseline GRPO methods.

Self-Hint Aligned GRPO with Privileged Supervision (SAGE) is an on-policy @@@@1@@@@ (RL) framework developed to address the challenge of aligning LLMs using group-based sparse-reward protocols. SAGE modifies the Group Relative Policy Optimization (GRPO) procedure by injecting privileged, compact hints—such as plans or solution decompositions—during training. These hints reshape the model’s rollout distribution without affecting the verifier reward, overcoming critical bottlenecks in RL from sparse rewards. During inference, the model operates without hints, reflecting only the improved capabilities learned during privileged supervision (Liao et al., 3 Feb 2026).

1. Theoretical Foundation and Training Objective

GRPO is a group-based RL protocol where a policy πθ\pi_\theta generates rollouts τ=(y1,,yT)\tau=(y_1, \ldots, y_T) given a prompt xDx \sim D, receiving a binary terminal reward R(x,τ){0,1}R(x, \tau) \in \{0, 1\}. Within a group of GG rollouts {τi}i=1G\{\tau_i\}_{i=1}^G, normalized advantages are computed as Ai=(RiRˉ)/(σ+ϵ)A_i = (R_i - \bar R)/(\sigma + \epsilon), with Rˉ=1GiRi\bar R = \frac{1}{G} \sum_i R_i and σ2=1Gi(RiRˉ)2\sigma^2 = \frac{1}{G} \sum_i (R_i - \bar R)^2. The GRPO gradient loss is:

LGRPO(θ)=Ex,τi[1Gi=1GAit=1τiθlogπθ(yi,tx,yi,<t)]L_{\text{GRPO}}(\theta) = -\mathbb{E}_{x, \tau_i} \left[ \frac{1}{G} \sum_{i=1}^G A_i \sum_{t=1}^{|\tau_i|} \nabla_\theta \log \pi_\theta(y_{i,t}|x, y_{i,<t}) \right]

GRPO often suffers from advantage collapse in the sparse-reward regime (p0(x)1p_0(x) \ll 1), where all rollouts within a group receive zero reward, resulting in σ=0\sigma = 0 and vanishing gradients. SAGE introduces privileged hints hh obtained by compressing reference solutions τ\tau^* into compact forms via a sampled hint-strength variable \ell; hq(hx,τ,)h \sim q(h \mid x, \tau^*, \ell).

For each prompt xx, SAGE’s workflow:

  • Sample hint level p(x,θ)\ell \sim p(\ell | x, \theta)
  • Sample hint hq(hx,τ,)h \sim q(h | x, \tau^*, \ell)
  • Generate group rollouts τiπθ(x,h)\tau_i \sim \pi_\theta(\cdot | x, h), compute rewards Ri=R(x,τi)R_i = R(x, \tau_i)
  • Compute advantages AiA_i
  • Aggregate policy gradient and optional KL penalty against reference policy πref\pi_{\text{ref}}

The SAGE loss:

LSAGE(θ)=ExD,,h,τi[LPG+βLKL]L_{\text{SAGE}}(\theta) = \mathbb{E}_{x \sim D, \ell, h, \tau_i}\left[ L_{\text{PG}} + \beta L_{\text{KL}} \right]

where LPGL_{\text{PG}} is the group advantage-weighted policy gradient and LKLL_{\text{KL}} is the reference KL penalty.

2. Advantage Preservation through Hint Injection

Under Bernoulli rewards, group variance is σ2=Rˉ(1Rˉ)\sigma^2 = \bar R (1 - \bar R), collapsing when all rewards are identical. The gate-opening probability that group variance is nonzero is u(p)=1(1p)GpGu(p) = 1 - (1 - p)^G - p^G, maximized at p=1/2p=1/2. SAGE raises the conditional success rate p(x,h)=Prτ[R(x,τ)=1x,h]p(x, h) = \Pr_\tau[R(x, \tau) = 1 | x, h] by injecting hints, avoiding Gp<1Gp<1 degeneracy and preserving gradient signals.

Jensen’s inequality shows that, at fixed mean pp, mixing across many hint distributions dilutes gate-opening probability. SAGE allocates compute to a single well-calibrated hint per prompt, optimizing non-degenerate group formation and learning efficiency.

3. On-Policy SAGE Training Workflow

The SAGE training loop operates as follows:

  • Input: Dataset (x,τ)(x, \tau^*), policy πθ\pi_\theta, reference πref\pi_{\text{ref}}, group size GG, hint level cap LL, KL weight β\beta, stabilizer ϵ\epsilon
  • Hint Generation: q(hx,τ,)q(h|x, \tau^*, \ell) via πθ\pi_\theta prompting
  • Scheduling: per-prompt policy (x)\ell(x), either by epoch-level SAGE-LIGHT or group-degeneracy SAGE full
  • Loop:
    • For each batch (xb,τb)(x_b, \tau^*_b), select (xb)\ell(x_b), sample hq(hxb,τb,)h \sim q(h|x_b, \tau^*_b, \ell)
    • Optionally probe with size G0G_0; if all R=0R=0, increment (xb)\ell(x_b)
    • Roll out GG trajectories, aggregate rewards, compute advantages
    • Compute and apply policy gradient and KL penalty
    • Update parameters via gradient descent

SAGE-LIGHT relies on epoch-wide empirical success for hint scheduling, while SAGE monitors per-prompt collapse directly.

4. Adaptive Curriculum and Online Self-Hinting

Self-hinting—the generation of hints by a lagged copy of πθ\pi_\theta—establishes an adaptive curriculum, ensuring hints remain matched to the learner’s proficiency. Fixed hint generators (initial policy, external LLM) often produce hints that become suboptimal as θ\theta evolves, either too easy (no gradient) or too hard (collapses). Online self-hinting maintains p(x,h)p(x, h) near optimal, maximizing update frequency and learning traction.

Empirical ablations demonstrate that online self-hinting is superior to both offline self-hinting and external-teacher hinting (e.g., GPT-5 scaffold), with a 1–2 point accuracy benefit, even when sampling multiple offline hints.

5. Deployment Protocol and Distributional Generalization

During evaluation, SAGE sets =0\ell = 0, enforcing h=h = \varnothing. The model executes its policy πθ(x)\pi_\theta(\cdot | x) without any privileged information, and since the reward function never depended on hints, test-time accuracy directly reflects reasoning improvements acquired during privileged supervision.

6. Empirical Results and Learning Dynamics

SAGE’s performance has been validated across six in-distribution (AIME24, AIME25, AMC23, MATH-500, Minerva-Math, Olympiad) and two out-of-distribution benchmarks (GPQA-diamond, MMLU-Pro). Three backbone LLMs (Llama-3.2-3B-Instruct, Qwen2.5-7B-Instruct, Qwen3-4B-Instruct) serve as baselines, compared against SFT, vanilla GRPO, LUFFY, and Scaf-GRPO.

Model SAGE Gain vs GRPO SAGE Gain vs Best Baseline Fraction Never Correct (GRPO→SAGE)
Llama-3.2-3B-Instruct +6.1 pts +2.0 pts 40.2% → 30.0%
Qwen2.5-7B-Instruct +4.5 pts n/a n/a
Qwen3-4B-Instruct +4.2 pts n/a n/a

SAGE achieves the highest average accuracy in both ID and OOD tests. SAGE-LIGHT is approximately 1 point less than full SAGE but only 1.2× slower than GRPO, offering compute efficiency.

Training reward curves show SAGE achieves continuous improvement while GRPO plateaus. SAGE sustains moderate πθ\pi_\theta entropy (on-policy exploration), in contrast to collapsed exploration in Scaf-GRPO and off-policy oscillation in LUFFY. SAGE also increases response length during training, correlating with accuracy enhancement.

–––––––––––––––––––––––––––––––––––––

In summary, Self-Hint Aligned GRPO with Privileged Supervision (SAGE) leverages online self-hint injection and adaptive curriculum scheduling to address advantage collapse in group-based RL. SAGE maintains on-policy update integrity, dynamically calibrates hint difficulty, and delivers empirical gains of 2–6 points in accuracy over strong baselines across multiple math and general knowledge benchmarks for diverse LLM backbones (Liao et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Hint Aligned GRPO with Privileged Supervision (SAGE).