Self-Hint Aligned GRPO (SAGE)
- The paper introduces SAGE, a framework that injects compact privileged hints to overcome advantage collapse in sparse-reward settings.
- SAGE employs online self-hinting and adaptive curriculum scheduling to maintain effective gradient signals and enhance policy learning.
- Empirical results demonstrate SAGE's gains of 2–6 points in accuracy and improved exploration compared to baseline GRPO methods.
Self-Hint Aligned GRPO with Privileged Supervision (SAGE) is an on-policy @@@@1@@@@ (RL) framework developed to address the challenge of aligning LLMs using group-based sparse-reward protocols. SAGE modifies the Group Relative Policy Optimization (GRPO) procedure by injecting privileged, compact hints—such as plans or solution decompositions—during training. These hints reshape the model’s rollout distribution without affecting the verifier reward, overcoming critical bottlenecks in RL from sparse rewards. During inference, the model operates without hints, reflecting only the improved capabilities learned during privileged supervision (Liao et al., 3 Feb 2026).
1. Theoretical Foundation and Training Objective
GRPO is a group-based RL protocol where a policy generates rollouts given a prompt , receiving a binary terminal reward . Within a group of rollouts , normalized advantages are computed as , with and . The GRPO gradient loss is:
GRPO often suffers from advantage collapse in the sparse-reward regime (), where all rollouts within a group receive zero reward, resulting in and vanishing gradients. SAGE introduces privileged hints obtained by compressing reference solutions into compact forms via a sampled hint-strength variable ; .
For each prompt , SAGE’s workflow:
- Sample hint level
- Sample hint
- Generate group rollouts , compute rewards
- Compute advantages
- Aggregate policy gradient and optional KL penalty against reference policy
The SAGE loss:
where is the group advantage-weighted policy gradient and is the reference KL penalty.
2. Advantage Preservation through Hint Injection
Under Bernoulli rewards, group variance is , collapsing when all rewards are identical. The gate-opening probability that group variance is nonzero is , maximized at . SAGE raises the conditional success rate by injecting hints, avoiding degeneracy and preserving gradient signals.
Jensen’s inequality shows that, at fixed mean , mixing across many hint distributions dilutes gate-opening probability. SAGE allocates compute to a single well-calibrated hint per prompt, optimizing non-degenerate group formation and learning efficiency.
3. On-Policy SAGE Training Workflow
The SAGE training loop operates as follows:
- Input: Dataset , policy , reference , group size , hint level cap , KL weight , stabilizer
- Hint Generation: via prompting
- Scheduling: per-prompt policy , either by epoch-level SAGE-LIGHT or group-degeneracy SAGE full
- Loop:
- For each batch , select , sample
- Optionally probe with size ; if all , increment
- Roll out trajectories, aggregate rewards, compute advantages
- Compute and apply policy gradient and KL penalty
- Update parameters via gradient descent
SAGE-LIGHT relies on epoch-wide empirical success for hint scheduling, while SAGE monitors per-prompt collapse directly.
4. Adaptive Curriculum and Online Self-Hinting
Self-hinting—the generation of hints by a lagged copy of —establishes an adaptive curriculum, ensuring hints remain matched to the learner’s proficiency. Fixed hint generators (initial policy, external LLM) often produce hints that become suboptimal as evolves, either too easy (no gradient) or too hard (collapses). Online self-hinting maintains near optimal, maximizing update frequency and learning traction.
Empirical ablations demonstrate that online self-hinting is superior to both offline self-hinting and external-teacher hinting (e.g., GPT-5 scaffold), with a 1–2 point accuracy benefit, even when sampling multiple offline hints.
5. Deployment Protocol and Distributional Generalization
During evaluation, SAGE sets , enforcing . The model executes its policy without any privileged information, and since the reward function never depended on hints, test-time accuracy directly reflects reasoning improvements acquired during privileged supervision.
6. Empirical Results and Learning Dynamics
SAGE’s performance has been validated across six in-distribution (AIME24, AIME25, AMC23, MATH-500, Minerva-Math, Olympiad) and two out-of-distribution benchmarks (GPQA-diamond, MMLU-Pro). Three backbone LLMs (Llama-3.2-3B-Instruct, Qwen2.5-7B-Instruct, Qwen3-4B-Instruct) serve as baselines, compared against SFT, vanilla GRPO, LUFFY, and Scaf-GRPO.
| Model | SAGE Gain vs GRPO | SAGE Gain vs Best Baseline | Fraction Never Correct (GRPO→SAGE) |
|---|---|---|---|
| Llama-3.2-3B-Instruct | +6.1 pts | +2.0 pts | 40.2% → 30.0% |
| Qwen2.5-7B-Instruct | +4.5 pts | n/a | n/a |
| Qwen3-4B-Instruct | +4.2 pts | n/a | n/a |
SAGE achieves the highest average accuracy in both ID and OOD tests. SAGE-LIGHT is approximately 1 point less than full SAGE but only 1.2× slower than GRPO, offering compute efficiency.
Training reward curves show SAGE achieves continuous improvement while GRPO plateaus. SAGE sustains moderate entropy (on-policy exploration), in contrast to collapsed exploration in Scaf-GRPO and off-policy oscillation in LUFFY. SAGE also increases response length during training, correlating with accuracy enhancement.
–––––––––––––––––––––––––––––––––––––
In summary, Self-Hint Aligned GRPO with Privileged Supervision (SAGE) leverages online self-hint injection and adaptive curriculum scheduling to address advantage collapse in group-based RL. SAGE maintains on-policy update integrity, dynamically calibrates hint difficulty, and delivers empirical gains of 2–6 points in accuracy over strong baselines across multiple math and general knowledge benchmarks for diverse LLM backbones (Liao et al., 3 Feb 2026).