Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zipformer: Median-Centered GRPO

Updated 9 February 2026
  • Zipformer (MC-GRPO) is a reinforcement learning method that replaces the mean baseline with a median to reduce the risk of sign flips in small-rollout regimes.
  • It improves advantage estimation by robustly mitigating the effect of reward outliers, ensuring stable policy updates in low-budget RL and RLHF settings.
  • Empirical results show that Zipformer achieves faster convergence and nearly matches high-group-size performance with minimal computational overhead.

Zipformer is not a distinct technique but refers specifically to Median-Centered Group Relative Policy Optimization (MC-GRPO), a robust extension of Group Relative Policy Optimization (GRPO) for reinforcement learning in the small-rollout regime. MC-GRPO replaces the mean baseline of GRPO with a median-based baseline, ensuring substantial improvements in stability and accuracy when the rollout budget is minimal. This methodology is especially significant for reinforcement learning from human feedback (RLHF) and related low-budget RL scenarios in LLM post-training.

1. Background: Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) generalizes classic policy-gradient objectives to settings where G rollouts are sampled per prompt and each completion’s advantage is computed relative to its contemporaneous group. For a given prompt qq, GRPO generates GG completions o1,,oGπθ(q)o_1, \ldots, o_G \sim \pi_\theta(\cdot|q), computes scalar rewards ri=R(q,oi)r_i = R(q, o_i), forms the group mean baseline b(q)=1Gj=1Grjb(q) = \frac{1}{G} \sum_{j=1}^G r_j, and defines advantages

Ai=rib(q)or normalizedAi=rib(q)sr(q)+ϵA_i = r_i - b(q) \quad \text{or normalized} \quad A_i = \frac{r_i - b(q)}{s_r(q) + \epsilon}

where sr(q)s_r(q) is the within-group reward standard deviation.

The per-token policy update is a clipped-surrogate Proximal Policy Optimization (PPO)-style objective:

J(θ)=Eq,oiπθold[1Gi=1Gt=1oimin(ρi,t(θ)Ai,ρ^i,t(θ)Ai)]J(\theta) = \mathbb{E}_{q, o_i \sim \pi_{\theta_{\text{old}}}} \Bigg[ \frac{1}{G} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \big( \rho_{i,t}(\theta) A_i, \hat{\rho}_{i,t}(\theta) A_i \big) \Bigg]

with ρi,t\rho_{i,t} denoting the token-level importance ratio.

Though effective and stable for large group sizes (G=832G=8\text{–}32), GRPO is highly sensitive to estimation noise for small group sizes. In particular, with G=2G=2 or G=4G=4, a single reward outlier can dominate the group mean, leading to incorrect advantage sign assignments and, consequently, reversed or destructive parameter updates (Kim, 30 Jan 2026).

2. The Sign-Flip Pathology in Small-Rollout GRPO

With small GG, the mean baseline b(q)b(q) becomes a noisy estimator of the true expected reward. A high-reward outlier can shift b(q)b(q) by enough to flip the computed advantage’s sign for multiple other rollouts. The rate of such “sign flips” (i.e., the fraction of completions whose advantage sign disagrees with the oracle sign derived from a large reference batch) empirically exceeds 25% for G=2G=2 and remains above 15% for G=4G=4. Synthetic injections of sign flips into policy updates show that even a 5% sign-flip rate can reduce downstream accuracy by roughly 4 percentage points (Kim, 30 Jan 2026).

This phenomenon renders naive GRPO unstable and ineffective in the low-GG regime, as it induces the optimizer to frequently penalize beneficial trajectories or reinforce suboptimal ones.

3. Median-Centered GRPO: The MC-GRPO Method

MC-GRPO addresses the instability of mean-based baselines by using the median as a robust center. Specifically:

  1. Sampling: For each prompt qq, sample an \emph{odd}-sized group of G+1G+1 rollouts o1,,oG+1πθold(q)o_1, \ldots, o_{G+1} \sim \pi_{\theta_{\text{old}}}(\cdot|q).
  2. Reward Computation: Compute each ri=R(q,oi)r_i = R(q, o_i).
  3. Baseline and Advantage:

    • Compute the group median bmed(q)=median{r1,,rG+1}b_{\text{med}}(q) = \mathrm{median}\{r_1, \ldots, r_{G+1}\}.
    • Calculate the median absolute deviation MAD(r)=median(ribmed)\mathit{MAD}(r) = \mathrm{median}(|r_i - b_{\text{med}}|).
    • Form robust normalized advantages:

    Ai=ribmed(q)MAD(r)+ϵ,i=1,,G+1A_i = \frac{r_i - b_{\text{med}}(q)}{\mathrm{MAD}(r) + \epsilon}, \quad i=1,\ldots,G+1

  • Exclude the unique median-pivot rollout ii^* (for which Ai=0A_{i^*}=0) from backpropagation; GG samples contribute to the gradient as in standard GRPO.

Algorithm 1 in (Kim, 30 Jan 2026) shows that MC-GRPO is a drop-in replacement for GRPO in any PPO-style pipeline and otherwise preserves all associated mechanisms, including surrogate clipping, KL-penalties, and importance sampling.

4. Theoretical and Empirical Justification for Median-Centering

The median is statistically less sensitive to outliers than the mean; by using it as a baseline, the risk of sign flips is minimized for small group sizes. For an odd-sized group, suspending the median-pivot from gradient computation avoids contributing noise to the unbiasedness and efficiency of the policy update, with the number of gradient-contributing rollouts per prompt identical to standard GG-rollout training (Kim, 30 Jan 2026).

Controlled experiments with Qwen3-1.7B on GSM8K and other LLMs demonstrate the following:

Setting Standard GRPO (G=2) GRPO (G=8) MC-GRPO (G=2) Gap Reduction
GSM8K, exact match (%) 78.9 84.5 83.5 5.6pp → 1.0pp

Similar improvements (2–4 percentage points at G=2G=2 and 2–3pp at G=4G=4) occur across Math-500 and multiple model scales (Kim, 30 Jan 2026). Training curves exhibit faster convergence and reduced fluctuations in reward when using MC-GRPO at low group sizes. The computational cost of the extra rollout is negligible (< few percent increase in total runtime), as backward passes are the dominant component.

5. Pseudocode and Integration in Existing RL Pipelines

The only modification required to implement MC-GRPO is the addition of one extra rollout per prompt, replacement of the mean baseline by the median, and exclusion of the median-pivot from the surrogate loss computation. All other components—PPO clipped surrogate, per-token broadcasting of sequence reward, KL penalty, and importance ratios—remain unchanged, guaranteeing full compatibility with pipelines using GRPO, DAPO, DR-GRPO, or similar approaches (Kim, 30 Jan 2026).

Algorithmic sketch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for q in dataset:
    # 1. Sample G+1 completions
    rollouts = [sample(pi_old, q) for _ in range(G+1)]
    rewards = [reward(q, o) for o in rollouts]
    # 2. Compute robust baseline and scale
    median = np.median(rewards)
    MAD = np.median([abs(r - median) for r in rewards]) + epsilon
    # 3. Compute advantages
    advantages = [(r - median) / MAD for r in rewards]
    # 4. Identify and exclude the median-pivot
    keep = [i for i, adv in enumerate(advantages) if adv != 0]
    # 5. Policy update step for remaining G rollouts
    for i in keep:
        # Broadcast A_i to the trajectory, compute PPO surrogate, etc
        ...

6. Generalization and Practical Impact

MC-GRPO robustifies group-relative policy optimization in any setting with small batch sizes per prompt, including but not limited to RLHF-style training, rule-model-based reinforcement learning, and latency- or memory-constrained environments. The method's empirical robustness and negligible runtime overhead make it suitable for high-throughput inference and resource-limited RL training, restoring stability and final task accuracy to within 1% of high-group-size training. Its generality extends directly to other GRPO-family surrogates and domains without further tuning (Kim, 30 Jan 2026).

7. Limitations and Future Directions

While MC-GRPO nearly eliminates baseline-induced sign flips in small-GG regimes and addresses outlier sensitivity, the underlying assumption is that the reward distribution is adequately well-behaved for the median to provide a stable location estimate. In extreme multimodal reward settings or those with pathological median behavior, further robustification may be necessary. Extensions to other forms of robust statistics, adaptive group size, and integration with dynamic advantage normalization or ensemble-based baselines are plausible avenues. Initial experiments confirm improved out-of-distribution generalization, but future work may explore real-world human-in-the-loop RLHF, adversarial prompt settings, and broader model classes (Kim, 30 Jan 2026).


References

  • "MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning" (Kim, 30 Jan 2026)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zipformer.