Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Published 12 May 2026 in cs.LG and cs.AI | (2605.12483v3)

Abstract: We present a four-stage post-training workflow for LLM reasoning that allocates scarce labeled training data more effectively than standard recipes. The stages are: (1) sparse-reward RL on a larger teacher; (2a) forward-KL warmup on teacher rollouts; (2b) on-policy distillation under student rollouts; (3) optional sparse-reward RL on the deployment student using any held-out labeled data. On verifiable math with a Qwen3-1.7B deployment student, the workflow reaches $79.3\%$ MATH and $25.2\%$ AIME~2024 (avg@16), versus $75.9\%$ and $19.8\%$ for direct GRPO on the same student. We justify the workflow through a reward-density principle: each gradient step of on-policy distillation is a local trust-region update under a dense teacher-induced implicit reward, informative only when the teacher is itself reward-shaped (condition C1) and lies within a small KL of the student (condition C2). Stages~1 and~2a are the explicit devices that enforce C1 and C2. A single component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$ points, and removing the on-policy distillation stage costs $3.3$ points. The recipe replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes a three-stage protocol that leverages teacher-side RL followed by dense transfer to optimize reward shaping for LLMs.
It unifies sparse sequence-level and dense token-level rewards under a KL-regularized framework, demonstrating improved performance on MATH and AIME benchmarks.
Empirical results show that the teacher-first strategy yields 3-5 point gains over direct RL, effectively utilizing scarce labeled data.

Empirical Allocation of Sparse and Dense Reward in LLM Post-Training

Overview

The paper "Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training" (2605.12483) systematically interrogates the role of reward signal density and allocation in post-training protocols for LLMs. The authors critically analyze the widespread practice of direct sparse-reward RL (e.g., GRPO/PPO) on deployment models and instead propose an alternative, evidence-driven reward allocation strategy that leverages the interplay between sparse sequence-level and dense token-level rewards. Their central claim is that scarce, verifiable, labeled data is best allocated first to RL fine-tuning of a capable "teacher," with subsequent dense teacher-to-student transfer, only later followed by student-side RL if additional data remains.

Theoretical Foundation: Reward-Density Principle and Unified Objective

Sparse (sequence-level) RL with reward $R(x, y)$ and dense (token-level) KL-based teacher supervision $r_T(s, y) = \beta \log \pi_T(y \mid s)$ inhabit endpoints of a unified policy optimization framework. Both are instantiations of a KL-regularized RL objective, with $\lambda$ controlling the interpolation between dense and sparse regimes. The paper precisely formalizes this, showing that on-policy distillation (OPD) is maximum-entropy RL with dense per-token reward derived from a teacher, whereas conventional RL is the $\lambda=1$ limit with only sparse rewards. This conceptual unification motivates the central allocation principle: sparse reward should be deployed where exploration is tractable, and dense reward should be deployed where behavioral compression is desired.

Empirical Protocol: From Teacher-Side RL to Dense Transfer

The paper proposes a three-stage allocation protocol:

Teacher-side RL: Apply available verifiable labeled data to RL fine-tuning of a strong teacher (large model) to maximize the extraction of reward-shaped behavior.
Dense Transfer via Two-Stage Bridge: Transfer reward-shaped behavior to the deployment student via a forward-KL (off-policy teacher-rollout) warmup, followed by on-policy distillation (student-rollout OPD). The forward-KL stage addresses support mismatch and stabilizes the subsequent on-policy phase.
Post-Bridge Student RL: Optional RL post-training on the student with any held-out portion of the labeled data.

Direct RL on a weak student is empirically shown to be suboptimal due to limited rollout diversity and severe credit assignment issues.

Experimental Results

Large-scale experiments are presented on the MATH dataset, AIME-2024/2025, using Qwen3 and Llama model families. Key results include:

At fixed student model size (Qwen3-1.7B), teacher-RL + two-stage dense transfer outperforms direct GRPO by notable margins (e.g., $79.3\%$ vs. $75.9\%$ on MATH).
Scale alone is insufficient: Transferring from raw or only SFT-trained teachers yields weaker students than direct RL; only RL-improved (reward-shaped) teachers provide optimal downstream supervision.
The two-stage bridge (FKL warmup + OPD) consistently surpasses OPD-only or teacher-sample SFT variants by $1.5$–$3$ points on core metrics, validating the necessity of addressing occupancy and support mismatch.
Student-side RL lifts post-bridge student performance by an additional $2$–$3$ points, but only after the student has absorbed the reward-shaped policy through dense transfer. Replay experiments confirm that these gains are attributable to new labeled data, not mere additional updates.
Ordering is robust across families: Raw-teacher transfer $r_T(s, y) = \beta \log \pi_T(y \mid s)$ 0 direct RL $r_T(s, y) = \beta \log \pi_T(y \mid s)$ 1 RL-teacher transfer is replicated on Llama students with large Llama-3.3-70B teachers.

Below is a compact comparison of critical configurations and representative numerical accuracies (avg@16):

Configuration	MATH (%)	AIME 2024 (%)
Direct GRPO (Qwen3-1.7B Student)	75.9	19.8
Raw Qwen3-8B Teacher (Bridge)	71.5	15.0
RL'd Qwen3-8B Teacher (Bridge)	79.3	25.2
Bridge + Student-side RL (Half Split)	78.5	23.7
Teacher-sample SFT (RL'd 8B)	76.0	22.4

Implications and Limitations

The empirical evidence supports a paradigm shift: reward-density allocation should prioritize teacher-side exploration with sparse reward followed by dense supervised transfer to the deployment model. This not only yields higher endpoint accuracy on mathematical reasoning benchmarks but also rationalizes data use when labeled examples are limited.

Practically, this protocol is directly actionable in model-family training regimes—especially for organizations maintaining multiple parameter scales with shared tokenization. It is necessary, however, for practitioners to note dependencies on compatible tokenizers, and further scaling studies (e.g., 70B students, 400B teachers) are warranted to validate persistence of the observed teacher-first allocation benefit.

The bridge design (two-stage FKL + OPD) is substantiated as critical—off-policy only or OPD-only strategies are shown to be systematically suboptimal. Moreover, while post-bridge student RL provides incremental improvement, it is not a substitute for upstream teacher shaping and dense transfer.

Limitations include:

All tasks focus on verifiable math with precise evaluators; extension to code, instruction following, or open-ended reasoning with weaker verifiers remains open.
All strong conclusions are within the scale of up to 14B (Qwen) and 70B (Llama) teachers and up to 8B students. Allocation tradeoffs at substantially larger scales are speculated upon but not determined here.

Theoretical and Practical Impact

Theoretically, this work unifies RL and on-policy distillation as ends of reward density within a shared policy optimization formalism. The empirical findings challenge default pipelines (direct cold RL on deployment models) and elevate the data allocation axis to a first-class design consideration. Offering a clear operational recipe based on verifiable outcomes, the results also inform new investigations into dense/sparse reward scheduling, bridge variants, and multi-teacher or multi-task settings.

Looking forward, advances could include:

Systematic studies in ultra-large model regimes,
Adapting the allocation rule for tasks with more ambiguous or costly verification,
Automated allocation schemes exploiting reward density and verifier noise.

Conclusion

This paper delivers a compelling case, both formal and empirical, for allocating scarce labeled reward data to RL shaping of strong teachers, with downstream dense transfer—rather than direct cold RL—producing superior deployment models for mathematical reasoning. The two-stage dense bridge is validated as an essential component, and incremental student-side RL is shown to augment, not replace, this allocation strategy. This teacher-first, reward-density-informed allocation principle will likely shape future post-training/finetuning best practices for LLM development (2605.12483).

Markdown Report Issue