Zipformer: Median-Centered GRPO
- Zipformer (MC-GRPO) is a reinforcement learning method that replaces the mean baseline with a median to reduce the risk of sign flips in small-rollout regimes.
- It improves advantage estimation by robustly mitigating the effect of reward outliers, ensuring stable policy updates in low-budget RL and RLHF settings.
- Empirical results show that Zipformer achieves faster convergence and nearly matches high-group-size performance with minimal computational overhead.
Zipformer is not a distinct technique but refers specifically to Median-Centered Group Relative Policy Optimization (MC-GRPO), a robust extension of Group Relative Policy Optimization (GRPO) for reinforcement learning in the small-rollout regime. MC-GRPO replaces the mean baseline of GRPO with a median-based baseline, ensuring substantial improvements in stability and accuracy when the rollout budget is minimal. This methodology is especially significant for reinforcement learning from human feedback (RLHF) and related low-budget RL scenarios in LLM post-training.
1. Background: Group Relative Policy Optimization
Group Relative Policy Optimization (GRPO) generalizes classic policy-gradient objectives to settings where G rollouts are sampled per prompt and each completion’s advantage is computed relative to its contemporaneous group. For a given prompt , GRPO generates completions , computes scalar rewards , forms the group mean baseline , and defines advantages
where is the within-group reward standard deviation.
The per-token policy update is a clipped-surrogate Proximal Policy Optimization (PPO)-style objective:
with denoting the token-level importance ratio.
Though effective and stable for large group sizes (), GRPO is highly sensitive to estimation noise for small group sizes. In particular, with or , a single reward outlier can dominate the group mean, leading to incorrect advantage sign assignments and, consequently, reversed or destructive parameter updates (Kim, 30 Jan 2026).
2. The Sign-Flip Pathology in Small-Rollout GRPO
With small , the mean baseline becomes a noisy estimator of the true expected reward. A high-reward outlier can shift by enough to flip the computed advantage’s sign for multiple other rollouts. The rate of such “sign flips” (i.e., the fraction of completions whose advantage sign disagrees with the oracle sign derived from a large reference batch) empirically exceeds 25% for and remains above 15% for . Synthetic injections of sign flips into policy updates show that even a 5% sign-flip rate can reduce downstream accuracy by roughly 4 percentage points (Kim, 30 Jan 2026).
This phenomenon renders naive GRPO unstable and ineffective in the low- regime, as it induces the optimizer to frequently penalize beneficial trajectories or reinforce suboptimal ones.
3. Median-Centered GRPO: The MC-GRPO Method
MC-GRPO addresses the instability of mean-based baselines by using the median as a robust center. Specifically:
- Sampling: For each prompt , sample an \emph{odd}-sized group of rollouts .
- Reward Computation: Compute each .
- Baseline and Advantage:
- Compute the group median .
- Calculate the median absolute deviation .
- Form robust normalized advantages:
- Exclude the unique median-pivot rollout (for which ) from backpropagation; samples contribute to the gradient as in standard GRPO.
Algorithm 1 in (Kim, 30 Jan 2026) shows that MC-GRPO is a drop-in replacement for GRPO in any PPO-style pipeline and otherwise preserves all associated mechanisms, including surrogate clipping, KL-penalties, and importance sampling.
4. Theoretical and Empirical Justification for Median-Centering
The median is statistically less sensitive to outliers than the mean; by using it as a baseline, the risk of sign flips is minimized for small group sizes. For an odd-sized group, suspending the median-pivot from gradient computation avoids contributing noise to the unbiasedness and efficiency of the policy update, with the number of gradient-contributing rollouts per prompt identical to standard -rollout training (Kim, 30 Jan 2026).
Controlled experiments with Qwen3-1.7B on GSM8K and other LLMs demonstrate the following:
| Setting | Standard GRPO (G=2) | GRPO (G=8) | MC-GRPO (G=2) | Gap Reduction |
|---|---|---|---|---|
| GSM8K, exact match (%) | 78.9 | 84.5 | 83.5 | 5.6pp → 1.0pp |
Similar improvements (2–4 percentage points at and 2–3pp at ) occur across Math-500 and multiple model scales (Kim, 30 Jan 2026). Training curves exhibit faster convergence and reduced fluctuations in reward when using MC-GRPO at low group sizes. The computational cost of the extra rollout is negligible (< few percent increase in total runtime), as backward passes are the dominant component.
5. Pseudocode and Integration in Existing RL Pipelines
The only modification required to implement MC-GRPO is the addition of one extra rollout per prompt, replacement of the mean baseline by the median, and exclusion of the median-pivot from the surrogate loss computation. All other components—PPO clipped surrogate, per-token broadcasting of sequence reward, KL penalty, and importance ratios—remain unchanged, guaranteeing full compatibility with pipelines using GRPO, DAPO, DR-GRPO, or similar approaches (Kim, 30 Jan 2026).
Algorithmic sketch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for q in dataset: # 1. Sample G+1 completions rollouts = [sample(pi_old, q) for _ in range(G+1)] rewards = [reward(q, o) for o in rollouts] # 2. Compute robust baseline and scale median = np.median(rewards) MAD = np.median([abs(r - median) for r in rewards]) + epsilon # 3. Compute advantages advantages = [(r - median) / MAD for r in rewards] # 4. Identify and exclude the median-pivot keep = [i for i, adv in enumerate(advantages) if adv != 0] # 5. Policy update step for remaining G rollouts for i in keep: # Broadcast A_i to the trajectory, compute PPO surrogate, etc ... |
6. Generalization and Practical Impact
MC-GRPO robustifies group-relative policy optimization in any setting with small batch sizes per prompt, including but not limited to RLHF-style training, rule-model-based reinforcement learning, and latency- or memory-constrained environments. The method's empirical robustness and negligible runtime overhead make it suitable for high-throughput inference and resource-limited RL training, restoring stability and final task accuracy to within 1% of high-group-size training. Its generality extends directly to other GRPO-family surrogates and domains without further tuning (Kim, 30 Jan 2026).
7. Limitations and Future Directions
While MC-GRPO nearly eliminates baseline-induced sign flips in small- regimes and addresses outlier sensitivity, the underlying assumption is that the reward distribution is adequately well-behaved for the median to provide a stable location estimate. In extreme multimodal reward settings or those with pathological median behavior, further robustification may be necessary. Extensions to other forms of robust statistics, adaptive group size, and integration with dynamic advantage normalization or ensemble-based baselines are plausible avenues. Initial experiments confirm improved out-of-distribution generalization, but future work may explore real-world human-in-the-loop RLHF, adversarial prompt settings, and broader model classes (Kim, 30 Jan 2026).
References
- "MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning" (Kim, 30 Jan 2026)