Median-Centered GRPO (MC-GRPO)
- MC-GRPO is a reinforcement learning algorithm that uses a median baseline to reduce the high-variance errors common in small rollout scenarios.
- The method excludes the median sample from policy updates, mitigating erroneous advantage sign flips and promoting stable gradient computations.
- Empirical results demonstrate that MC-GRPO narrows the accuracy gap between low and high rollout settings, improving overall model performance.
Median-Centered Group Relative Policy Optimization (MC-GRPO) is a reinforcement learning algorithm designed to enhance the stability and sample efficiency of sequence-generation model training under small-rollout budgets. It is a robust modification to the Group-Relative Policy Optimization (GRPO) framework, addressing the high-variance failure modes that arise when the number of rollouts per prompt is small. By replacing the conventional mean baseline with a median baseline, MC-GRPO reduces the prevalence of erroneous advantage sign flips, thereby improving both the stability and accuracy of policy gradients even in resource-constrained regimes (Kim, 30 Jan 2026).
1. Formal Problem Setup
Consider the standard GRPO setting for training LLMs via reinforcement learning. Let denote a batch of prompts. For each prompt , the policy parametrized by generates independent rollouts (completions) . Each rollout defines a trajectory with scalar return . The core GRPO method computes within-prompt group statistics—the sample mean and standard deviation 0:
1
The estimated advantage for each rollout is computed as 2, and the policy gradient step is
3
This baseline normalization is sensitive to outlier rewards, especially when 4 is small.
2. Baseline Instabilities: Mean vs. Median
The traditional GRPO mean baseline, 5, is vulnerable under small 6. A single outlier reward 7 can significantly skew the mean, resulting in “advantage sign flips,” where the estimated advantage 8 for a rollout switches sign relative to the true (oracle) advantage. This leads to stochastic or even adversarial policy updates, as the gradient update direction can be reversed. Formally, contamination of the mean by an outlier 9 causes
0
which may invert the sign. In contrast, the group median as a baseline, 1, is substantially less sensitive to single outliers, requiring half the group to be contaminated before the median shifts. Thus, sign(2) is more stable.
3. MC-GRPO Algorithmic Structure
MC-GRPO modifies GRPO by using the median of 3 rollout rewards as the baseline and omitting the zero-advantage sample from the policy gradient:
- For each prompt 4, generate 5 rollouts 6.
- Compute corresponding rewards 7.
- Determine the median index 8 so that 9.
- For all 0, compute raw advantage 1. Optionally, normalize by the robust scale (MAD), 2, with 3.
- Exclude the median sample 4 (where 5) from the gradient update.
- Policy gradient step:
6
This approach preserves the core computational cost per update, as only 7 samples contribute to the backward pass—identical to standard 8-rollout GRPO.
4. Theoretical Properties and Empirical Analysis
Empirical evaluation measures the “sign-flip rate,” defined as the proportion of rollouts whose sign(9) under the estimated baseline disagrees with the “oracle” large-0 reference. At 1 and 2, MC-GRPO reduces sign-flip rates by 50–80% relative to mean baseline GRPO. Artificially injecting a proportion 3 of random sign-flips in 4 causes a linear degradation in final accuracy, confirming the detrimental impact of sign-flip errors.
For variance and bias, the mean baseline is unbiased with variance 5. The median baseline incurs slight bias in small samples but admits variance 6 for heavy-tailed reward distributions, resulting in improved robustness for small 7.
5. Practical Considerations and Hyperparameter Choices
MC-GRPO is a drop-in modification to existing GRPO implementations:
- Recommended rollout number (8): MC-GRPO is most beneficial at small rollout sizes (9); improvements diminish for 0 as mean baseline stability increases.
- Learning rate and optimizer: Recommended to use the same settings as GRPO (e.g., 1 with AdamW).
- Median computation: Sorting 2 scalar rewards; computational cost is negligible for 3.
- Robust normalization: Optional use of MAD for further outlier resistance.
- Update semantics: Excluding the median sample ensures exactly 4 samples contribute, with no change to KL penalty, reward clipping, or other loss terms.
6. Experimental Outcomes
Empirical validation includes training on GSM8K with Qwen3-1.7B and Llama-3.2-3B, and on Math-500 with Qwen variants. For Qwen3-1.7B on GSM8K, the measured accuracy demonstrates that MC-GRPO significantly closes the performance gap between small and large rollout settings:
| Rollouts 5 | GRPO Accuracy (%) | MC-GRPO Accuracy (%) | 6 |
|---|---|---|---|
| 2 | 78.9 | 83.5 | +4.6 |
| 4 | 81.3 | 84.0 | +2.7 |
| 8 | 84.5 | 84.6 | +0.1 |
The accuracy gap between 7 and 8 is reduced from 9 under GRPO to 0 under MC-GRPO. Similar results were observed for other tasks and model families, including DAPO and DR-GRPO variants, with a negligible wall-clock overhead (additional rollout increases generation time by 1; gradient computation remains the dominant cost).
7. Significance and Integration
MC-GRPO provides a principled and practical means of attaining robust policy optimization in scenarios where rollout efficiency is paramount. The method directly targets and attenuates the sign-flip pathology endemic to small-group mean baselines, notably in reinforcement learning for LLM fine-tuning. The minimal algorithmic changes, computational efficiency, and systematic empirical improvements position MC-GRPO as a particularly attractive method for low-budget RL training regimes (Kim, 30 Jan 2026).