Papers
Topics
Authors
Recent
Search
2000 character limit reached

Median-Centered GRPO (MC-GRPO)

Updated 1 June 2026
  • MC-GRPO is a reinforcement learning algorithm that uses a median baseline to reduce the high-variance errors common in small rollout scenarios.
  • The method excludes the median sample from policy updates, mitigating erroneous advantage sign flips and promoting stable gradient computations.
  • Empirical results demonstrate that MC-GRPO narrows the accuracy gap between low and high rollout settings, improving overall model performance.

Median-Centered Group Relative Policy Optimization (MC-GRPO) is a reinforcement learning algorithm designed to enhance the stability and sample efficiency of sequence-generation model training under small-rollout budgets. It is a robust modification to the Group-Relative Policy Optimization (GRPO) framework, addressing the high-variance failure modes that arise when the number of rollouts per prompt is small. By replacing the conventional mean baseline with a median baseline, MC-GRPO reduces the prevalence of erroneous advantage sign flips, thereby improving both the stability and accuracy of policy gradients even in resource-constrained regimes (Kim, 30 Jan 2026).

1. Formal Problem Setup

Consider the standard GRPO setting for training LLMs via reinforcement learning. Let Q\mathcal{Q} denote a batch of prompts. For each prompt qQq \in \mathcal{Q}, the policy πθ(oq)\pi_\theta(o\mid q) parametrized by θ\theta generates GG independent rollouts (completions) o1,,oGo_1, \ldots, o_G. Each rollout oio_i defines a trajectory τi=(q,oi)\tau_i = (q, o_i) with scalar return Ri=R(q,oi)R_i = R(q, o_i). The core GRPO method computes within-prompt group statistics—the sample mean Rˉ\bar{R} and standard deviation qQq \in \mathcal{Q}0:

qQq \in \mathcal{Q}1

The estimated advantage for each rollout is computed as qQq \in \mathcal{Q}2, and the policy gradient step is

qQq \in \mathcal{Q}3

This baseline normalization is sensitive to outlier rewards, especially when qQq \in \mathcal{Q}4 is small.

2. Baseline Instabilities: Mean vs. Median

The traditional GRPO mean baseline, qQq \in \mathcal{Q}5, is vulnerable under small qQq \in \mathcal{Q}6. A single outlier reward qQq \in \mathcal{Q}7 can significantly skew the mean, resulting in “advantage sign flips,” where the estimated advantage qQq \in \mathcal{Q}8 for a rollout switches sign relative to the true (oracle) advantage. This leads to stochastic or even adversarial policy updates, as the gradient update direction can be reversed. Formally, contamination of the mean by an outlier qQq \in \mathcal{Q}9 causes

πθ(oq)\pi_\theta(o\mid q)0

which may invert the sign. In contrast, the group median as a baseline, πθ(oq)\pi_\theta(o\mid q)1, is substantially less sensitive to single outliers, requiring half the group to be contaminated before the median shifts. Thus, sign(πθ(oq)\pi_\theta(o\mid q)2) is more stable.

3. MC-GRPO Algorithmic Structure

MC-GRPO modifies GRPO by using the median of πθ(oq)\pi_\theta(o\mid q)3 rollout rewards as the baseline and omitting the zero-advantage sample from the policy gradient:

  1. For each prompt πθ(oq)\pi_\theta(o\mid q)4, generate πθ(oq)\pi_\theta(o\mid q)5 rollouts πθ(oq)\pi_\theta(o\mid q)6.
  2. Compute corresponding rewards πθ(oq)\pi_\theta(o\mid q)7.
  3. Determine the median index πθ(oq)\pi_\theta(o\mid q)8 so that πθ(oq)\pi_\theta(o\mid q)9.
  4. For all θ\theta0, compute raw advantage θ\theta1. Optionally, normalize by the robust scale (MAD), θ\theta2, with θ\theta3.
  5. Exclude the median sample θ\theta4 (where θ\theta5) from the gradient update.
  6. Policy gradient step:

θ\theta6

This approach preserves the core computational cost per update, as only θ\theta7 samples contribute to the backward pass—identical to standard θ\theta8-rollout GRPO.

4. Theoretical Properties and Empirical Analysis

Empirical evaluation measures the “sign-flip rate,” defined as the proportion of rollouts whose sign(θ\theta9) under the estimated baseline disagrees with the “oracle” large-GG0 reference. At GG1 and GG2, MC-GRPO reduces sign-flip rates by 50–80% relative to mean baseline GRPO. Artificially injecting a proportion GG3 of random sign-flips in GG4 causes a linear degradation in final accuracy, confirming the detrimental impact of sign-flip errors.

For variance and bias, the mean baseline is unbiased with variance GG5. The median baseline incurs slight bias in small samples but admits variance GG6 for heavy-tailed reward distributions, resulting in improved robustness for small GG7.

5. Practical Considerations and Hyperparameter Choices

MC-GRPO is a drop-in modification to existing GRPO implementations:

  • Recommended rollout number (GG8): MC-GRPO is most beneficial at small rollout sizes (GG9); improvements diminish for o1,,oGo_1, \ldots, o_G0 as mean baseline stability increases.
  • Learning rate and optimizer: Recommended to use the same settings as GRPO (e.g., o1,,oGo_1, \ldots, o_G1 with AdamW).
  • Median computation: Sorting o1,,oGo_1, \ldots, o_G2 scalar rewards; computational cost is negligible for o1,,oGo_1, \ldots, o_G3.
  • Robust normalization: Optional use of MAD for further outlier resistance.
  • Update semantics: Excluding the median sample ensures exactly o1,,oGo_1, \ldots, o_G4 samples contribute, with no change to KL penalty, reward clipping, or other loss terms.

6. Experimental Outcomes

Empirical validation includes training on GSM8K with Qwen3-1.7B and Llama-3.2-3B, and on Math-500 with Qwen variants. For Qwen3-1.7B on GSM8K, the measured accuracy demonstrates that MC-GRPO significantly closes the performance gap between small and large rollout settings:

Rollouts o1,,oGo_1, \ldots, o_G5 GRPO Accuracy (%) MC-GRPO Accuracy (%) o1,,oGo_1, \ldots, o_G6
2 78.9 83.5 +4.6
4 81.3 84.0 +2.7
8 84.5 84.6 +0.1

The accuracy gap between o1,,oGo_1, \ldots, o_G7 and o1,,oGo_1, \ldots, o_G8 is reduced from o1,,oGo_1, \ldots, o_G9 under GRPO to oio_i0 under MC-GRPO. Similar results were observed for other tasks and model families, including DAPO and DR-GRPO variants, with a negligible wall-clock overhead (additional rollout increases generation time by oio_i1; gradient computation remains the dominant cost).

7. Significance and Integration

MC-GRPO provides a principled and practical means of attaining robust policy optimization in scenarios where rollout efficiency is paramount. The method directly targets and attenuates the sign-flip pathology endemic to small-group mean baselines, notably in reinforcement learning for LLM fine-tuning. The minimal algorithmic changes, computational efficiency, and systematic empirical improvements position MC-GRPO as a particularly attractive method for low-budget RL training regimes (Kim, 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Median-Centered GRPO (MC-GRPO).