Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-Level GRPO: Two-Level Policy Optimization

Updated 4 July 2026
  • Bi-Level GRPO is a structural generalization of group-relative policy optimization that decomposes the process into two coupled layers to resolve credit-assignment issues.
  • It employs mechanisms such as learnable token preferences, co-distillation between models, and phase-specific decompositions to improve signal propagation across tokens and phases.
  • Empirical results across various benchmarks demonstrate enhanced training stability, sample efficiency, and performance gains compared to standard GRPO, despite increased computational complexity.

Searching arXiv for papers on bi-level GRPO and related GRPO variants. Bi-Level GRPO denotes a class of Group Relative Policy Optimization formulations in which optimization is explicitly decomposed across two coupled levels rather than treated as a single uniform GRPO update. In the materials associated with recent GRPO variants, this bi-level structure appears in several distinct forms: a learnable token-preference outer variable in λ\lambda-GRPO (Wang et al., 8 Oct 2025), joint teacher–student co-distillation in CoDistill-GRPO (Kwon et al., 9 May 2026), a group-to-token construction in TEPO (Lin et al., 10 Oct 2025), explorer–learner separation in S2L-PO (Ren et al., 29 May 2026), phase-specific generation–ranking decomposition in F-GRPO (Surana et al., 13 May 2026), and sequential reasoning–self-correction layers in Multi-Layer GRPO (Ding et al., 5 Jun 2025). Across these formulations, the common motivation is the same: standard GRPO applies a single group-derived training signal too uniformly, which creates credit-assignment pathologies such as length bias, sparse-reward inefficiency, cross-phase interference, or weak exploration.

1. Conceptual definition and scope

In standard GRPO-style training, a prompt is paired with a group of sampled responses, each response receives a group-normalized advantage, and a PPO-style clipped surrogate is optimized over the tokens of those responses. Several recent works reinterpret this basic mechanism as insufficiently structured for problems in which the source of error or success is heterogeneous across tokens, policies, or phases of generation (Wang et al., 8 Oct 2025).

Within that perspective, “bi-level GRPO” is not a single algorithm but a family of designs that introduce an additional optimization layer, auxiliary policy, or second decision phase. The lower level typically performs a standard or near-standard GRPO-style policy update under fixed auxiliary structure, while the upper level either learns that structure, supplies exploration data, or optimizes a coupled phase-specific objective. This suggests that bi-level GRPO is best understood as a structural generalization of GRPO rather than a narrowly defined method.

The main variants documented in the supplied literature differ in what is elevated to the second level. In λ\lambda-GRPO, the second level is a learnable token-preference parameter λ\lambda that controls length-adaptive weighting (Wang et al., 8 Oct 2025). In CoDistill-GRPO, it is the joint interaction between a large model πθ\pi_\theta and a small model πϕ\pi_\phi under distinct GRPO objectives (Kwon et al., 9 May 2026). In TEPO, the two levels are the group reward and a token-level aggregation mediated by Markov Likelihood (Lin et al., 10 Oct 2025). In S2L-PO, the lower level is a frozen small explorer policy and the upper level is a trainable large learner optimized on a mixture rollout distribution (Ren et al., 29 May 2026). In F-GRPO, the two levels correspond to generation and ranking phases with separate group-relative advantages (Surana et al., 13 May 2026). In MGRPO, the two layers are sequential GRPO passes for initial reasoning and self-correction (Ding et al., 5 Jun 2025).

2. Standard GRPO as the baseline that bi-level methods modify

A recurring starting point is the contemporary GRPO setup in which every sampled response o1,,oGo_1,\dots,o_G for a prompt xx is assigned a group-normalized advantage

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.

Each token oi,to_{i,t} then receives the same scalar advantage, and vanilla GRPO aggregates by

fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,

leading to the clipped-surrogate loss

λ\lambda0

where

λ\lambda1

This formulation is the baseline against which several bi-level variants are defined (Wang et al., 8 Oct 2025).

The central criticism is that the same advantage is broadcast uniformly over all tokens of a response, while aggregation heuristics are fixed in advance. In the length-bias analysis of λ\lambda2-GRPO, vanilla GRPO’s fixed λ\lambda3 is said to “implicitly treat each response equally at the sequence level but uniformly dilutes advantage over its tokens,” creating “a tug-of-war in which very long sequences dilute signal and very short ones overconcentrate it” (Wang et al., 8 Oct 2025). TEPO makes a related but distinct criticism: sparse group rewards are linked to token optimization only through undifferentiated token-level handling, and entropy-based fixes can induce entropy collapse or model collapse (Lin et al., 10 Oct 2025). F-GRPO identifies an analogous issue at the phase level: applying a single scalar end-of-sequence reward to a rollout that both generates a candidate set and ranks it conflates generation and ranking errors (Surana et al., 13 May 2026).

These diagnoses explain why bi-level formulations recur in GRPO research. They do not abandon group-relative normalization; rather, they preserve it while inserting an intermediate structure that changes where and how the group signal is redistributed.

3. Learnable token preference and explicit bi-level optimization in λ\lambda4-GRPO

Among the cited works, λ\lambda5-GRPO gives the most explicit optimization-theoretic definition of bi-level GRPO. It unifies GRPO-style frameworks through a general “unified token-preference” objective

λ\lambda6

in which DAPO and Dr. GRPO are treated as special cases (Wang et al., 8 Oct 2025).

The key modification is to replace a fixed weighting function λ\lambda7 with a learnable, length-adaptive weighting. The method computes within-group statistics

λ\lambda8

standardizes lengths and shifts them around λ\lambda9,

λ\lambda0

then applies an exponent λ\lambda1 and a softmax normalization,

λ\lambda2

Under this parameterization, λ\lambda3 biases training toward shorter replies, λ\lambda4 toward longer ones, and λ\lambda5 recovers length-neutral token weighting (Wang et al., 8 Oct 2025).

The bi-level structure is then written directly as an inner and outer problem. The inner problem fixes λ\lambda6 and updates the policy:

λ\lambda7

where

λ\lambda8

The outer problem updates the preference parameter:

λ\lambda9

The paper also states that one may equivalently take stochastic-gradient steps on πθ\pi_\theta0 simultaneously with πθ\pi_\theta1, using

πθ\pi_\theta2

with

πθ\pi_\theta3

Because the gradient πθ\pi_\theta4 flows through the softmax weights πθ\pi_\theta5, the method “automatically adapts πθ\pi_\theta6 to the observed correlation between response length and reward advantages” (Wang et al., 8 Oct 2025).

The reported empirical result is that on eight mathematical reasoning benchmarks—GSM8K, MATH500, Minerva, Gaokao, OlympiadBench, College Math, AIME@32, and AMC@32—Qwen2.5 models fine-tuned with πθ\pi_\theta7-GRPO achieve consistent gains over GRPO and DAPO. The averages reported are 37.8% for Qwen2.5-1.5B, 43.8% for Qwen2.5-3B, and 53.5% for Qwen2.5-7B, corresponding to gains of πθ\pi_\theta8 pp, πθ\pi_\theta9 pp, and πϕ\pi_\phi0 pp over GRPO, with no modifications to the training data and “no extra compute” in the sense of the same batch size, rollouts, and GPU time (Wang et al., 8 Oct 2025). The same source states that the method yields more stable training curves, higher token-entropy, and controlled response length.

4. Coupled policies as bi-level GRPO: co-distillation and small-to-large exploration

A second major interpretation of bi-level GRPO treats the two levels as interacting policies rather than inner and outer variables. CoDistill-GRPO is formulated around a large model with parameters πϕ\pi_\phi1 and a small model with parameters πϕ\pi_\phi2, each optimized with its own GRPO objective in a single joint loop (Kwon et al., 9 May 2026).

For the small model, the reward is augmented by an on-policy KD term from the large model:

πϕ\pi_\phi3

Advantages are centered across the chosen rollouts, and the small-model GRPO loss uses the PPO-style ratio

πϕ\pi_\phi4

The paper states that because the KD term depends on πϕ\pi_\phi5, differentiation requires care, and Lemma 4.1 shows that the resulting gradient is unbiased once the learning rate is rescaled by πϕ\pi_\phi6. Theorem 4.2 is summarized as

πϕ\pi_\phi7

which is presented as establishing that the distillation term pushes the small policy toward the large one (Kwon et al., 9 May 2026).

The large model is updated on off-policy rollouts generated by πϕ\pi_\phi8, corrected using per-token importance weights

πϕ\pi_\phi9

Conceptually, the bi-level structure is the co-training loop itself: the small model learns from the large model’s distribution via KD reward, while the large model learns from small-model rollouts with importance reweighting (Kwon et al., 9 May 2026).

This joint setup is motivated by sparse rewards on difficult tasks, which the paper argues often prevent GRPO from improving small models. On Qwen2.5-Math-1.5B, average accuracy across Minerva, MATH500, AMC, and OlympiadBench is reported as 32.6% for the base SFT model, 47.1% for standard GRPO, and up to 49.9% for CoDistill-GRPO with o1,,oGo_1,\dots,o_G0. On Minerva alone, the best CoDistill-GRPO run reaches 32.0% versus 25.9% for vanilla GRPO, described as a o1,,oGo_1,\dots,o_G1 pp jump. For the larger Qwen2.5-Math-7B, pure CoDistill-GRPO reaches 53.7% average versus 57.9% for standard GRPO, but CoDistill-GRPO with continued training reaches 57.3%, “nearly” matching standard GRPO while reducing rollout generation time per iteration by approximately 18% (Kwon et al., 9 May 2026).

S2L-PO presents a different two-policy construction. Here the lower-level policy o1,,oGo_1,\dots,o_G2 is a frozen small explorer selected for high policy-level diversity under a cost constraint, and the upper-level policy o1,,oGo_1,\dots,o_G3 is a trainable large learner optimized via a GRPO surrogate on a mixed rollout distribution (Ren et al., 29 May 2026). The mixture is governed by an annealing schedule:

o1,,oGo_1,\dots,o_G4

with rollout counts

o1,,oGo_1,\dots,o_G5

The learner’s objective is

o1,,oGo_1,\dots,o_G6

where o1,,oGo_1,\dots,o_G7 trajectories come from o1,,oGo_1,\dots,o_G8 and o1,,oGo_1,\dots,o_G9 from xx0 (Ren et al., 29 May 2026).

The stated theoretical distinction is between token-level randomness and policy-level perturbation. Token-level randomness is said to inject i.i.d. noise that causes prefix-match probability to decay and yields only xx1 variance growth, whereas policy-level perturbation via a fixed xx2 produces coherent additive terms whose contributions reinforce, with “xx3 signal growth” (Ren et al., 29 May 2026). Empirically, the reported Pass@1 gains on AIME24 under 16-rollout evaluation are from 15.0 to 23.8 for Qwen3-8B using a 1.7B explorer, from 18.0 to 24.4 for Qwen3-14B using a 4B explorer, and from 0.1 to 4.6 for InternLM2.5-7B using a 1.8B explorer; the paper also reports similar xx4–xx5 pp gains on AIME25, MATH-500, and OlympiadBench, peak convergence in effective steps that is xx6 faster, and a xx7–xx8 reduction in total rollout FLOPs from reusing off-policy rollouts (Ren et al., 29 May 2026).

5. Group-to-token and phase-specific decompositions

Another bi-level interpretation restructures credit assignment within a single model rather than across two interacting policies. TEPO formulates one level at the group reward and another at token-level aggregation through Markov Likelihood (Lin et al., 10 Oct 2025).

At the group level, for a prompt xx9 and responses Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.0, a normalized advantage is computed as

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.1

At the token-aggregation level, TEPO exploits the first-order Markov factorization

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.2

to define a sequence-level importance ratio through a geometric mean:

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.3

The resulting loss is

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.4

with gradient

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.5

This is described as a bi-level link in which the group-level reward modulates a sequence-level reweighting, which then back-propagates uniformly to token-level policy gradients (Lin et al., 10 Oct 2025).

The paper emphasizes that no explicit entropy bonus is added; instead, the geometric mean in the Markov Likelihood is said to normalize by sequence length, tame variance spikes, and prevent entropy collapse, while clipping the sequence-level ratio enforces a trust region (Lin et al., 10 Oct 2025). On seven mathematical reasoning benchmarks using Qwen2.5-7B, the reported average accuracy improves from 30.30% for a GRPO baseline to 32.04% for TEPO, with MATH-500 improving from 72.40% to 77.20%, AIME24 from 11.56% to 12.18%, AIME25 from 6.25% to 7.60%, and AMC from 40.47% to 43.56% (Lin et al., 10 Oct 2025).

F-GRPO applies a different decomposition in tasks where a single autoregressive rollout must both generate a slate and rank it. The joint policy

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.6

is paired with two rewards: an order-invariant coverage reward

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.7

and a position-aware utility reward

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.8

Group baselines are formed separately,

Ai,t=R(x,oi)    1GjR(x,oj)1Gj(R(x,oj)1GkR(x,ok))2.A_{i,t} = \frac{R(x,o_i)\;-\;\frac1G\sum_{j}R(x,o_j)}{\sqrt{\frac1G\sum_{j}\bigl(R(x,o_j)-\frac1G\sum_{k}R(x,o_k)\bigr)^2}}\,.9

yielding phase-specific advantages

oi,to_{i,t}0

The optimization target is

oi,to_{i,t}1

plus optional KL-regularization (Surana et al., 13 May 2026).

This factorization is explicitly motivated as a remedy to “cross-phase reward contamination.” The gradients on slate tokens are weighted only by oi,to_{i,t}2 and those on ranking tokens only by oi,to_{i,t}3, which the paper presents as eliminating the ambiguity introduced by a single scalar end-of-sequence reward (Surana et al., 13 May 2026). Across LastFM, MovieLens, HotpotQA, and MuSiQue, F-GRPO is reported to yield oi,to_{i,t}4–oi,to_{i,t}5 relative Recall@5 and NDCG@5 over vanilla GRPO, along with gains over decoupled SFT pipelines and supervised fine-tuning, while remaining competitive with specialized rerankers such as MonoT5, DuoT5, RankZephyr, and LiT5 (Surana et al., 13 May 2026).

6. Sequential layers and self-correction in Multi-Layer GRPO

Multi-Layer GRPO (MGRPO) instantiates bi-level GRPO as two sequential GRPO passes over the same problem instance. The first layer performs standard GRPO on initial reasoning traces; the second layer performs GRPO again on self-corrections conditioned on those initial responses (Ding et al., 5 Jun 2025).

In Layer 1, the policy oi,to_{i,t}6 is prompted with the original query oi,to_{i,t}7 and produces oi,to_{i,t}8 candidate reasoning traces oi,to_{i,t}9. A rule-based verifier assigns each fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,0 a binary reward fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,1, and the standard GRPO objective is applied:

fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,2

Here,

fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,3

Each first-layer response fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,4 is then concatenated with the original query to form a new prompt

fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,5

where fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,6 is a short, randomly sampled guiding phrase such as “Wait, let me double-check that,” intended to encourage self-reflection (Ding et al., 5 Jun 2025). In Layer 2, the same policy samples fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,7 corrected or refined trajectories fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,8 for each fGRPO(oi)=μoi,μ=1Gjoj,f_{\rm GRPO}(o_i)=\tfrac{\mu}{|o_i|},\quad \mu=\tfrac1G\sum_{j}|o_j|\,,9, verifies them, and retains only successful corrections or confirmations. A second GRPO update is then computed over the retained refinements:

λ\lambda00

The motivation is the reward-collapse problem in one-shot GRPO: a single mistake in a long chain of reasoning produces zero final reward, erasing credit for earlier correct steps. MGRPO is described as recycling “failed” attempts into learning signals by rewarding successful repair, thereby approximating process-level supervision without requiring explicit dense annotations (Ding et al., 5 Jun 2025).

The reported evaluation uses Qwen2.5-Math-7B-base on GSM8K, MATH500, Minerva Math, and OlympiadBench. MGRPO is said to match standard one-round GRPO at Layer 1 and then improve after Layer 2 to 95.6% versus 83.4% on GSM8K, 90.4% versus 80.9% on MATH500, 39.3% versus 35.1% on Minerva Math, and 50.4% versus 39.9% on OlympiadBench. The reported positive λ\lambda01 of 2–5% and negligible λ\lambda02 of approximately 0.3% are presented as evidence that the second layer predominantly converts failures into successes without substantially harming initially correct answers (Ding et al., 5 Jun 2025). The same source notes that the extra cost of sampling and verifying λ\lambda03 refinements per initial answer “roughly doubles compute.”

7. Common motivations, advantages, and limitations

Across these formulations, the principal motivation for bi-level GRPO is improved credit assignment under sparse or structurally ambiguous rewards. In λ\lambda04-GRPO, the problem is length bias induced by fixed token-level aggregation (Wang et al., 8 Oct 2025). In CoDistill-GRPO, it is sparse reward that prevents small models from improving on difficult tasks (Kwon et al., 9 May 2026). In TEPO, it is the instability of entropy-based adjustments under sparse token rewards in chain-of-thought optimization (Lin et al., 10 Oct 2025). In S2L-PO, it is insufficient rollout diversity and the incoherence introduced by token-level randomness (Ren et al., 29 May 2026). In F-GRPO, it is the inability of a single terminal reward to distinguish poor generation from poor ranking (Surana et al., 13 May 2026). In MGRPO, it is reward vanishing caused by one error invalidating an entire reasoning chain (Ding et al., 5 Jun 2025).

The practical benefits reported in these works are correspondingly heterogeneous. Some methods improve accuracy with little or no additional compute relative to a chosen GRPO baseline, as stated for λ\lambda05-GRPO (Wang et al., 8 Oct 2025). Others trade added structural complexity for better sample efficiency or rollout efficiency, as in CoDistill-GRPO’s approximate 18% speedup and S2L-PO’s reported λ\lambda06–λ\lambda07 reduction in total rollout FLOPs (Kwon et al., 9 May 2026, Ren et al., 29 May 2026). F-GRPO emphasizes stability and sample efficiency from phase-isolated gradients (Surana et al., 13 May 2026), while MGRPO emphasizes self-correction at the cost of substantially more sampling (Ding et al., 5 Jun 2025).

Several limitations are also explicit. TEPO notes that it still propagates the same advantage to every token of a sequence and does not distinguish which tokens were most critical to success (Lin et al., 10 Oct 2025). MGRPO states that instances for which no corrected refinement is possible are discarded and contribute no learning signal, and that the added second layer roughly doubles compute (Ding et al., 5 Jun 2025). CoDistill-GRPO reports that the large model trained purely with co-distillation lags standard GRPO unless followed by continued training (Kwon et al., 9 May 2026). S2L-PO identifies a capacity-limit issue in the small explorer and therefore relies on progressive annealing to avoid mid-training performance drops (Ren et al., 29 May 2026). These caveats indicate that bi-level GRPO should not be understood as a universally dominant replacement for standard GRPO, but as a family of targeted restructurings designed to address particular GRPO failure modes.

A plausible implication is that “bi-level” in GRPO research now functions as a general design principle for inserting structure between group-level reward formation and final policy updates. The cited works realize that principle at different loci—token weighting, multi-policy distillation, sequence-level likelihood aggregation, phase decomposition, and self-correction—while preserving the group-relative normalization that defines GRPO itself.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-Level GRPO.