Self-Hint Aligned GRPO (SAGE)

Updated 5 February 2026

The paper introduces SAGE, a framework that injects compact privileged hints to overcome advantage collapse in sparse-reward settings.
SAGE employs online self-hinting and adaptive curriculum scheduling to maintain effective gradient signals and enhance policy learning.
Empirical results demonstrate SAGE's gains of 2–6 points in accuracy and improved exploration compared to baseline GRPO methods.

Self-Hint Aligned GRPO with Privileged Supervision (SAGE) is an on-policy reinforcement learning (RL) framework developed to address the challenge of aligning LLMs using group-based sparse-reward protocols. SAGE modifies the Group Relative Policy Optimization (GRPO) procedure by injecting privileged, compact hints—such as plans or solution decompositions—during training. These hints reshape the model’s rollout distribution without affecting the verifier reward, overcoming critical bottlenecks in RL from sparse rewards. During inference, the model operates without hints, reflecting only the improved capabilities learned during privileged supervision (Liao et al., 3 Feb 2026).

1. Theoretical Foundation and Training Objective

GRPO is a group-based RL protocol where a policy $\pi_\theta$ generates rollouts $\tau=(y_1, \ldots, y_T)$ given a prompt $x \sim D$ , receiving a binary terminal reward $R(x, \tau) \in \{0, 1\}$ . Within a group of $G$ rollouts $\{\tau_i\}_{i=1}^G$ , normalized advantages are computed as $A_i = (R_i - \bar R)/(\sigma + \epsilon)$ , with $\bar R = \frac{1}{G} \sum_i R_i$ and $\sigma^2 = \frac{1}{G} \sum_i (R_i - \bar R)^2$ . The GRPO gradient loss is:

$L_{\text{GRPO}}(\theta) = -\mathbb{E}_{x, \tau_i} \left[ \frac{1}{G} \sum_{i=1}^G A_i \sum_{t=1}^{|\tau_i|} \nabla_\theta \log \pi_\theta(y_{i,t}|x, y_{i,<t}) \right]$

GRPO often suffers from advantage collapse in the sparse-reward regime ( $\tau=(y_1, \ldots, y_T)$ 0), where all rollouts within a group receive zero reward, resulting in $\tau=(y_1, \ldots, y_T)$ 1 and vanishing gradients. SAGE introduces privileged hints $\tau=(y_1, \ldots, y_T)$ 2 obtained by compressing reference solutions $\tau=(y_1, \ldots, y_T)$ 3 into compact forms via a sampled hint-strength variable $\tau=(y_1, \ldots, y_T)$ 4; $\tau=(y_1, \ldots, y_T)$ 5.

For each prompt $\tau=(y_1, \ldots, y_T)$ 6, SAGE’s workflow:

Sample hint level $\tau=(y_1, \ldots, y_T)$ 7
Sample hint $\tau=(y_1, \ldots, y_T)$ 8
Generate group rollouts $\tau=(y_1, \ldots, y_T)$ 9, compute rewards $x \sim D$ 0
Compute advantages $x \sim D$ 1
Aggregate policy gradient and optional KL penalty against reference policy $x \sim D$ 2

The SAGE loss:

$x \sim D$ 3

where $x \sim D$ 4 is the group advantage-weighted policy gradient and $x \sim D$ 5 is the reference KL penalty.

2. Advantage Preservation through Hint Injection

Under Bernoulli rewards, group variance is $x \sim D$ 6, collapsing when all rewards are identical. The gate-opening probability that group variance is nonzero is $x \sim D$ 7, maximized at $x \sim D$ 8. SAGE raises the conditional success rate $x \sim D$ 9 by injecting hints, avoiding $R(x, \tau) \in \{0, 1\}$ 0 degeneracy and preserving gradient signals.

Jensen’s inequality shows that, at fixed mean $R(x, \tau) \in \{0, 1\}$ 1, mixing across many hint distributions dilutes gate-opening probability. SAGE allocates compute to a single well-calibrated hint per prompt, optimizing non-degenerate group formation and learning efficiency.

3. On-Policy SAGE Training Workflow

The SAGE training loop operates as follows:

Input: Dataset $R(x, \tau) \in \{0, 1\}$ 2, policy $R(x, \tau) \in \{0, 1\}$ 3, reference $R(x, \tau) \in \{0, 1\}$ 4, group size $R(x, \tau) \in \{0, 1\}$ 5, hint level cap $R(x, \tau) \in \{0, 1\}$ 6, KL weight $R(x, \tau) \in \{0, 1\}$ 7, stabilizer $R(x, \tau) \in \{0, 1\}$ 8
Hint Generation: $R(x, \tau) \in \{0, 1\}$ 9 via $G$ 0 prompting
Scheduling: per-prompt policy $G$ 1, either by epoch-level SAGE-LIGHT or group-degeneracy SAGE full
Loop:
- For each batch $G$ 2, select $G$ 3, sample $G$ 4
- Optionally probe with size $G$ 5; if all $G$ 6, increment $G$ 7
- Roll out $G$ 8 trajectories, aggregate rewards, compute advantages
- Compute and apply policy gradient and KL penalty
- Update parameters via gradient descent

SAGE-LIGHT relies on epoch-wide empirical success for hint scheduling, while SAGE monitors per-prompt collapse directly.

4. Adaptive Curriculum and Online Self-Hinting

Self-hinting—the generation of hints by a lagged copy of $G$ 9—establishes an adaptive curriculum, ensuring hints remain matched to the learner’s proficiency. Fixed hint generators (initial policy, external LLM) often produce hints that become suboptimal as $\{\tau_i\}_{i=1}^G$ 0 evolves, either too easy (no gradient) or too hard (collapses). Online self-hinting maintains $\{\tau_i\}_{i=1}^G$ 1 near optimal, maximizing update frequency and learning traction.

Empirical ablations demonstrate that online self-hinting is superior to both offline self-hinting and external-teacher hinting (e.g., GPT-5 scaffold), with a 1–2 point accuracy benefit, even when sampling multiple offline hints.

5. Deployment Protocol and Distributional Generalization

During evaluation, SAGE sets $\{\tau_i\}_{i=1}^G$ 2, enforcing $\{\tau_i\}_{i=1}^G$ 3. The model executes its policy $\{\tau_i\}_{i=1}^G$ 4 without any privileged information, and since the reward function never depended on hints, test-time accuracy directly reflects reasoning improvements acquired during privileged supervision.

6. Empirical Results and Learning Dynamics

SAGE’s performance has been validated across six in-distribution (AIME24, AIME25, AMC23, MATH-500, Minerva-Math, Olympiad) and two out-of-distribution benchmarks (GPQA-diamond, MMLU-Pro). Three backbone LLMs (Llama-3.2-3B-Instruct, Qwen2.5-7B-Instruct, Qwen3-4B-Instruct) serve as baselines, compared against SFT, vanilla GRPO, LUFFY, and Scaf-GRPO.

Model	SAGE Gain vs GRPO	SAGE Gain vs Best Baseline	Fraction Never Correct (GRPO→SAGE)
Llama-3.2-3B-Instruct	+6.1 pts	+2.0 pts	40.2% → 30.0%
Qwen2.5-7B-Instruct	+4.5 pts	n/a	n/a
Qwen3-4B-Instruct	+4.2 pts	n/a	n/a

SAGE achieves the highest average accuracy in both ID and OOD tests. SAGE-LIGHT is approximately 1 point less than full SAGE but only 1.2× slower than GRPO, offering compute efficiency.

Training reward curves show SAGE achieves continuous improvement while GRPO plateaus. SAGE sustains moderate $\{\tau_i\}_{i=1}^G$ 5 entropy (on-policy exploration), in contrast to collapsed exploration in Scaf-GRPO and off-policy oscillation in LUFFY. SAGE also increases response length during training, correlating with accuracy enhancement.

–––––––––––––––––––––––––––––––––––––

In summary, Self-Hint Aligned GRPO with Privileged Supervision (SAGE) leverages online self-hint injection and adaptive curriculum scheduling to address advantage collapse in group-based RL. SAGE maintains on-policy update integrity, dynamically calibrates hint difficulty, and delivers empirical gains of 2–6 points in accuracy over strong baselines across multiple math and general knowledge benchmarks for diverse LLM backbones (Liao et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Self-Hinting Language Models Enhance Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Hint Aligned GRPO with Privileged Supervision (SAGE).