Bi-Level GRPO: Two-Level Policy Optimization
- Bi-Level GRPO is a structural generalization of group-relative policy optimization that decomposes the process into two coupled layers to resolve credit-assignment issues.
- It employs mechanisms such as learnable token preferences, co-distillation between models, and phase-specific decompositions to improve signal propagation across tokens and phases.
- Empirical results across various benchmarks demonstrate enhanced training stability, sample efficiency, and performance gains compared to standard GRPO, despite increased computational complexity.
Searching arXiv for papers on bi-level GRPO and related GRPO variants. Bi-Level GRPO denotes a class of Group Relative Policy Optimization formulations in which optimization is explicitly decomposed across two coupled levels rather than treated as a single uniform GRPO update. In the materials associated with recent GRPO variants, this bi-level structure appears in several distinct forms: a learnable token-preference outer variable in -GRPO (Wang et al., 8 Oct 2025), joint teacher–student co-distillation in CoDistill-GRPO (Kwon et al., 9 May 2026), a group-to-token construction in TEPO (Lin et al., 10 Oct 2025), explorer–learner separation in S2L-PO (Ren et al., 29 May 2026), phase-specific generation–ranking decomposition in F-GRPO (Surana et al., 13 May 2026), and sequential reasoning–self-correction layers in Multi-Layer GRPO (Ding et al., 5 Jun 2025). Across these formulations, the common motivation is the same: standard GRPO applies a single group-derived training signal too uniformly, which creates credit-assignment pathologies such as length bias, sparse-reward inefficiency, cross-phase interference, or weak exploration.
1. Conceptual definition and scope
In standard GRPO-style training, a prompt is paired with a group of sampled responses, each response receives a group-normalized advantage, and a PPO-style clipped surrogate is optimized over the tokens of those responses. Several recent works reinterpret this basic mechanism as insufficiently structured for problems in which the source of error or success is heterogeneous across tokens, policies, or phases of generation (Wang et al., 8 Oct 2025).
Within that perspective, “bi-level GRPO” is not a single algorithm but a family of designs that introduce an additional optimization layer, auxiliary policy, or second decision phase. The lower level typically performs a standard or near-standard GRPO-style policy update under fixed auxiliary structure, while the upper level either learns that structure, supplies exploration data, or optimizes a coupled phase-specific objective. This suggests that bi-level GRPO is best understood as a structural generalization of GRPO rather than a narrowly defined method.
The main variants documented in the supplied literature differ in what is elevated to the second level. In -GRPO, the second level is a learnable token-preference parameter that controls length-adaptive weighting (Wang et al., 8 Oct 2025). In CoDistill-GRPO, it is the joint interaction between a large model and a small model under distinct GRPO objectives (Kwon et al., 9 May 2026). In TEPO, the two levels are the group reward and a token-level aggregation mediated by Markov Likelihood (Lin et al., 10 Oct 2025). In S2L-PO, the lower level is a frozen small explorer policy and the upper level is a trainable large learner optimized on a mixture rollout distribution (Ren et al., 29 May 2026). In F-GRPO, the two levels correspond to generation and ranking phases with separate group-relative advantages (Surana et al., 13 May 2026). In MGRPO, the two layers are sequential GRPO passes for initial reasoning and self-correction (Ding et al., 5 Jun 2025).
2. Standard GRPO as the baseline that bi-level methods modify
A recurring starting point is the contemporary GRPO setup in which every sampled response for a prompt is assigned a group-normalized advantage
Each token then receives the same scalar advantage, and vanilla GRPO aggregates by
leading to the clipped-surrogate loss
0
where
1
This formulation is the baseline against which several bi-level variants are defined (Wang et al., 8 Oct 2025).
The central criticism is that the same advantage is broadcast uniformly over all tokens of a response, while aggregation heuristics are fixed in advance. In the length-bias analysis of 2-GRPO, vanilla GRPO’s fixed 3 is said to “implicitly treat each response equally at the sequence level but uniformly dilutes advantage over its tokens,” creating “a tug-of-war in which very long sequences dilute signal and very short ones overconcentrate it” (Wang et al., 8 Oct 2025). TEPO makes a related but distinct criticism: sparse group rewards are linked to token optimization only through undifferentiated token-level handling, and entropy-based fixes can induce entropy collapse or model collapse (Lin et al., 10 Oct 2025). F-GRPO identifies an analogous issue at the phase level: applying a single scalar end-of-sequence reward to a rollout that both generates a candidate set and ranks it conflates generation and ranking errors (Surana et al., 13 May 2026).
These diagnoses explain why bi-level formulations recur in GRPO research. They do not abandon group-relative normalization; rather, they preserve it while inserting an intermediate structure that changes where and how the group signal is redistributed.
3. Learnable token preference and explicit bi-level optimization in 4-GRPO
Among the cited works, 5-GRPO gives the most explicit optimization-theoretic definition of bi-level GRPO. It unifies GRPO-style frameworks through a general “unified token-preference” objective
6
in which DAPO and Dr. GRPO are treated as special cases (Wang et al., 8 Oct 2025).
The key modification is to replace a fixed weighting function 7 with a learnable, length-adaptive weighting. The method computes within-group statistics
8
standardizes lengths and shifts them around 9,
0
then applies an exponent 1 and a softmax normalization,
2
Under this parameterization, 3 biases training toward shorter replies, 4 toward longer ones, and 5 recovers length-neutral token weighting (Wang et al., 8 Oct 2025).
The bi-level structure is then written directly as an inner and outer problem. The inner problem fixes 6 and updates the policy:
7
where
8
The outer problem updates the preference parameter:
9
The paper also states that one may equivalently take stochastic-gradient steps on 0 simultaneously with 1, using
2
with
3
Because the gradient 4 flows through the softmax weights 5, the method “automatically adapts 6 to the observed correlation between response length and reward advantages” (Wang et al., 8 Oct 2025).
The reported empirical result is that on eight mathematical reasoning benchmarks—GSM8K, MATH500, Minerva, Gaokao, OlympiadBench, College Math, AIME@32, and AMC@32—Qwen2.5 models fine-tuned with 7-GRPO achieve consistent gains over GRPO and DAPO. The averages reported are 37.8% for Qwen2.5-1.5B, 43.8% for Qwen2.5-3B, and 53.5% for Qwen2.5-7B, corresponding to gains of 8 pp, 9 pp, and 0 pp over GRPO, with no modifications to the training data and “no extra compute” in the sense of the same batch size, rollouts, and GPU time (Wang et al., 8 Oct 2025). The same source states that the method yields more stable training curves, higher token-entropy, and controlled response length.
4. Coupled policies as bi-level GRPO: co-distillation and small-to-large exploration
A second major interpretation of bi-level GRPO treats the two levels as interacting policies rather than inner and outer variables. CoDistill-GRPO is formulated around a large model with parameters 1 and a small model with parameters 2, each optimized with its own GRPO objective in a single joint loop (Kwon et al., 9 May 2026).
For the small model, the reward is augmented by an on-policy KD term from the large model:
3
Advantages are centered across the chosen rollouts, and the small-model GRPO loss uses the PPO-style ratio
4
The paper states that because the KD term depends on 5, differentiation requires care, and Lemma 4.1 shows that the resulting gradient is unbiased once the learning rate is rescaled by 6. Theorem 4.2 is summarized as
7
which is presented as establishing that the distillation term pushes the small policy toward the large one (Kwon et al., 9 May 2026).
The large model is updated on off-policy rollouts generated by 8, corrected using per-token importance weights
9
Conceptually, the bi-level structure is the co-training loop itself: the small model learns from the large model’s distribution via KD reward, while the large model learns from small-model rollouts with importance reweighting (Kwon et al., 9 May 2026).
This joint setup is motivated by sparse rewards on difficult tasks, which the paper argues often prevent GRPO from improving small models. On Qwen2.5-Math-1.5B, average accuracy across Minerva, MATH500, AMC, and OlympiadBench is reported as 32.6% for the base SFT model, 47.1% for standard GRPO, and up to 49.9% for CoDistill-GRPO with 0. On Minerva alone, the best CoDistill-GRPO run reaches 32.0% versus 25.9% for vanilla GRPO, described as a 1 pp jump. For the larger Qwen2.5-Math-7B, pure CoDistill-GRPO reaches 53.7% average versus 57.9% for standard GRPO, but CoDistill-GRPO with continued training reaches 57.3%, “nearly” matching standard GRPO while reducing rollout generation time per iteration by approximately 18% (Kwon et al., 9 May 2026).
S2L-PO presents a different two-policy construction. Here the lower-level policy 2 is a frozen small explorer selected for high policy-level diversity under a cost constraint, and the upper-level policy 3 is a trainable large learner optimized via a GRPO surrogate on a mixed rollout distribution (Ren et al., 29 May 2026). The mixture is governed by an annealing schedule:
4
with rollout counts
5
The learner’s objective is
6
where 7 trajectories come from 8 and 9 from 0 (Ren et al., 29 May 2026).
The stated theoretical distinction is between token-level randomness and policy-level perturbation. Token-level randomness is said to inject i.i.d. noise that causes prefix-match probability to decay and yields only 1 variance growth, whereas policy-level perturbation via a fixed 2 produces coherent additive terms whose contributions reinforce, with “3 signal growth” (Ren et al., 29 May 2026). Empirically, the reported Pass@1 gains on AIME24 under 16-rollout evaluation are from 15.0 to 23.8 for Qwen3-8B using a 1.7B explorer, from 18.0 to 24.4 for Qwen3-14B using a 4B explorer, and from 0.1 to 4.6 for InternLM2.5-7B using a 1.8B explorer; the paper also reports similar 4–5 pp gains on AIME25, MATH-500, and OlympiadBench, peak convergence in effective steps that is 6 faster, and a 7–8 reduction in total rollout FLOPs from reusing off-policy rollouts (Ren et al., 29 May 2026).
5. Group-to-token and phase-specific decompositions
Another bi-level interpretation restructures credit assignment within a single model rather than across two interacting policies. TEPO formulates one level at the group reward and another at token-level aggregation through Markov Likelihood (Lin et al., 10 Oct 2025).
At the group level, for a prompt 9 and responses 0, a normalized advantage is computed as
1
At the token-aggregation level, TEPO exploits the first-order Markov factorization
2
to define a sequence-level importance ratio through a geometric mean:
3
The resulting loss is
4
with gradient
5
This is described as a bi-level link in which the group-level reward modulates a sequence-level reweighting, which then back-propagates uniformly to token-level policy gradients (Lin et al., 10 Oct 2025).
The paper emphasizes that no explicit entropy bonus is added; instead, the geometric mean in the Markov Likelihood is said to normalize by sequence length, tame variance spikes, and prevent entropy collapse, while clipping the sequence-level ratio enforces a trust region (Lin et al., 10 Oct 2025). On seven mathematical reasoning benchmarks using Qwen2.5-7B, the reported average accuracy improves from 30.30% for a GRPO baseline to 32.04% for TEPO, with MATH-500 improving from 72.40% to 77.20%, AIME24 from 11.56% to 12.18%, AIME25 from 6.25% to 7.60%, and AMC from 40.47% to 43.56% (Lin et al., 10 Oct 2025).
F-GRPO applies a different decomposition in tasks where a single autoregressive rollout must both generate a slate and rank it. The joint policy
6
is paired with two rewards: an order-invariant coverage reward
7
and a position-aware utility reward
8
Group baselines are formed separately,
9
yielding phase-specific advantages
0
The optimization target is
1
plus optional KL-regularization (Surana et al., 13 May 2026).
This factorization is explicitly motivated as a remedy to “cross-phase reward contamination.” The gradients on slate tokens are weighted only by 2 and those on ranking tokens only by 3, which the paper presents as eliminating the ambiguity introduced by a single scalar end-of-sequence reward (Surana et al., 13 May 2026). Across LastFM, MovieLens, HotpotQA, and MuSiQue, F-GRPO is reported to yield 4–5 relative Recall@5 and NDCG@5 over vanilla GRPO, along with gains over decoupled SFT pipelines and supervised fine-tuning, while remaining competitive with specialized rerankers such as MonoT5, DuoT5, RankZephyr, and LiT5 (Surana et al., 13 May 2026).
6. Sequential layers and self-correction in Multi-Layer GRPO
Multi-Layer GRPO (MGRPO) instantiates bi-level GRPO as two sequential GRPO passes over the same problem instance. The first layer performs standard GRPO on initial reasoning traces; the second layer performs GRPO again on self-corrections conditioned on those initial responses (Ding et al., 5 Jun 2025).
In Layer 1, the policy 6 is prompted with the original query 7 and produces 8 candidate reasoning traces 9. A rule-based verifier assigns each 0 a binary reward 1, and the standard GRPO objective is applied:
2
Here,
3
Each first-layer response 4 is then concatenated with the original query to form a new prompt
5
where 6 is a short, randomly sampled guiding phrase such as “Wait, let me double-check that,” intended to encourage self-reflection (Ding et al., 5 Jun 2025). In Layer 2, the same policy samples 7 corrected or refined trajectories 8 for each 9, verifies them, and retains only successful corrections or confirmations. A second GRPO update is then computed over the retained refinements:
00
The motivation is the reward-collapse problem in one-shot GRPO: a single mistake in a long chain of reasoning produces zero final reward, erasing credit for earlier correct steps. MGRPO is described as recycling “failed” attempts into learning signals by rewarding successful repair, thereby approximating process-level supervision without requiring explicit dense annotations (Ding et al., 5 Jun 2025).
The reported evaluation uses Qwen2.5-Math-7B-base on GSM8K, MATH500, Minerva Math, and OlympiadBench. MGRPO is said to match standard one-round GRPO at Layer 1 and then improve after Layer 2 to 95.6% versus 83.4% on GSM8K, 90.4% versus 80.9% on MATH500, 39.3% versus 35.1% on Minerva Math, and 50.4% versus 39.9% on OlympiadBench. The reported positive 01 of 2–5% and negligible 02 of approximately 0.3% are presented as evidence that the second layer predominantly converts failures into successes without substantially harming initially correct answers (Ding et al., 5 Jun 2025). The same source notes that the extra cost of sampling and verifying 03 refinements per initial answer “roughly doubles compute.”
7. Common motivations, advantages, and limitations
Across these formulations, the principal motivation for bi-level GRPO is improved credit assignment under sparse or structurally ambiguous rewards. In 04-GRPO, the problem is length bias induced by fixed token-level aggregation (Wang et al., 8 Oct 2025). In CoDistill-GRPO, it is sparse reward that prevents small models from improving on difficult tasks (Kwon et al., 9 May 2026). In TEPO, it is the instability of entropy-based adjustments under sparse token rewards in chain-of-thought optimization (Lin et al., 10 Oct 2025). In S2L-PO, it is insufficient rollout diversity and the incoherence introduced by token-level randomness (Ren et al., 29 May 2026). In F-GRPO, it is the inability of a single terminal reward to distinguish poor generation from poor ranking (Surana et al., 13 May 2026). In MGRPO, it is reward vanishing caused by one error invalidating an entire reasoning chain (Ding et al., 5 Jun 2025).
The practical benefits reported in these works are correspondingly heterogeneous. Some methods improve accuracy with little or no additional compute relative to a chosen GRPO baseline, as stated for 05-GRPO (Wang et al., 8 Oct 2025). Others trade added structural complexity for better sample efficiency or rollout efficiency, as in CoDistill-GRPO’s approximate 18% speedup and S2L-PO’s reported 06–07 reduction in total rollout FLOPs (Kwon et al., 9 May 2026, Ren et al., 29 May 2026). F-GRPO emphasizes stability and sample efficiency from phase-isolated gradients (Surana et al., 13 May 2026), while MGRPO emphasizes self-correction at the cost of substantially more sampling (Ding et al., 5 Jun 2025).
Several limitations are also explicit. TEPO notes that it still propagates the same advantage to every token of a sequence and does not distinguish which tokens were most critical to success (Lin et al., 10 Oct 2025). MGRPO states that instances for which no corrected refinement is possible are discarded and contribute no learning signal, and that the added second layer roughly doubles compute (Ding et al., 5 Jun 2025). CoDistill-GRPO reports that the large model trained purely with co-distillation lags standard GRPO unless followed by continued training (Kwon et al., 9 May 2026). S2L-PO identifies a capacity-limit issue in the small explorer and therefore relies on progressive annealing to avoid mid-training performance drops (Ren et al., 29 May 2026). These caveats indicate that bi-level GRPO should not be understood as a universally dominant replacement for standard GRPO, but as a family of targeted restructurings designed to address particular GRPO failure modes.
A plausible implication is that “bi-level” in GRPO research now functions as a general design principle for inserting structure between group-level reward formation and final policy updates. The cited works realize that principle at different loci—token weighting, multi-policy distillation, sequence-level likelihood aggregation, phase decomposition, and self-correction—while preserving the group-relative normalization that defines GRPO itself.