Geometric-Mean Policy Optimization (GMPO)

Updated 29 July 2025

Geometric-Mean Policy Optimization (GMPO) is a reinforcement learning method that stabilizes policy updates by using the geometric mean of reward- or advantage-weighted ratios.
It mitigates high variance and outlier effects by clipping token-level importance ratios and promoting controlled, robust updates in complex LLM tasks.
Empirical evaluations show GMPO delivers improved performance on math and multimodal benchmarks, with smoother reward trajectories and sustained token entropy during training.

Geometric-Mean Policy Optimization (GMPO) is a reinforcement learning algorithmic principle and family of methods that perform policy improvement by maximizing the geometric mean of reward- or advantage-weighted policy ratios, rather than the more traditional arithmetic mean. This design reduces sensitivity to outlier importance sampling ratios and leads to substantially improved update stability, particularly in LLM reinforcement learning and other settings where unstable policy updates can degrade performance. GMPO has been theoretically and empirically validated, notably in the context of fine-tuning LLMs for complex reasoning and multimodal tasks, and is positioned as a stabilized alternative to arithmetic-mean-based objectives such as Group Relative Policy Optimization (GRPO) and related policy optimization algorithms (Zhao et al., 28 Jul 2025).

1. Motivation and Theoretical Foundation

GMPO arises from the observation that optimizing the arithmetic mean of token-level (or step-level) reward-weighted policy ratios—as in GRPO and related RL algorithms—exposes policy updates to high variance and instability. This instability is driven by the presence of outlier importance sampling ratios, i.e., tokens or actions where the ratio between the current policy $\pi_\theta$ and the reference or old policy $\pi_{old}$ becomes exceedingly large or small. Such outliers can result in excessively aggressive policy updates or in cases, reward collapse.

To address this, GMPO formulates the objective over the geometric mean, which aggregates token-level (or step-level) contributions multiplicatively and then applies a $1/n$-th root ( $n$ = sequence length or number of tokens), effectively dampening the influence of any single extreme value. The GMPO objective, in the canonical LLM RL case, is: $J_{\text{GMPO}}(\pi_\theta) = \mathbb{E}_{q,\{o_i\}}\left[ \frac{1}{G} \sum_{i=1}^G \left\{ \prod_{t=1}^{|o_i|} \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{old}(o_{i,t} \mid q, o_{i,<t})} \middle| \hat{A}_i \middle| \right) \right\}^{\frac{1}{|o_i|} \cdot sgn(\hat{A}_i)} \right],$ where $\hat{A}_i$ is a per-sequence advantage estimate and $sgn(\cdot)$ ensures correct handling of large positive/negative advantage values. This multiplicative, root-averaging structure (the geometric mean) narrows the overall value range and reduces update variance, leading to a more robust and predictable optimization trajectory (Zhao et al., 28 Jul 2025).

2. Mechanisms for Stability: Importance Ratios and Clipping

Central to PPO- and GRPO-like RL methods is the use of importance sampling ratios between $\pi_\theta$ and $\pi_{old}$ . In the arithmetic mean formulation, these ratios are summed (or averaged) across tokens. When any ratio is extreme due to stochasticity or rare events in sequence modeling, it dominates the mean and destabilizes learning.

GMPO employs the geometric mean to aggregate these ratios, dramatically reducing the susceptibility to such outliers. Additionally, GMPO incorporates token-level clipping, where for thresholds $[\epsilon_1, \epsilon_2]$ , the per-token importance ratio is clipped within this interval. The geometric mean structure ensures that the effect of the clipping is global: any token exceeding the thresholds has its impact mitigated multiplicatively for the entire sequence. This property further stabilizes the magnitude and variance of policy updates.

A key effect of the geometric mean is to allow the use of wider clipping thresholds without incurring instability; in GMPO, greater exploration via higher token entropy during training can be encouraged with minimal risk of policy collapse (Zhao et al., 28 Jul 2025).

3. Comparative Analysis: GMPO vs. Arithmetic-Mean Methods

GMPO is positioned as a stabilized alternative to methods such as GRPO and PPO, both of which optimize arithmetic mean-based objectives. The arithmetic mean is particularly sensitive to high-variance, heavy-tailed distributional patterns in token-level rewards, as occur in non-i.i.d. or highly structured outputs (e.g., LLM reasoning chains). As a result, such methods require tight clipping and additional heuristics to maintain stable learning.

The geometric mean, by construction, is less affected by the magnitude of any single term, and so naturally constrains variance propagation through update steps. Analytical and empirical studies demonstrate that GMPO produces:

Lower reward variance in the policy update objective.
More stable KL divergence between the trained policy and the pre-trained (reference) model.
Higher token-level entropy, reflecting improved exploration without collapse.
Superior accuracy on challenging language and multimodal reasoning benchmarks, with reported Pass@1 accuracy improvements of 4.1% (math tasks) and 1.4% (multimodal tasks) relative to GRPO (Zhao et al., 28 Jul 2025).

4. Empirical Performance and Evaluation

GMPO has been evaluated on a suite of mathematical and multimodal reasoning benchmarks, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Across these diverse datasets, GMPO-7B showed a consistent advantage in Pass@1 accuracy relative to arithmetic-mean-based GRPO.

Empirical indicators during training, such as the reward trajectory, KL divergence from a base model, and token entropy, were monitored. GMPO exhibited stable and smooth reward curves, slower accumulation of KL divergence (indicating less overfitting to idiosyncratic reward spikes), and higher, sustained entropy throughout RL training. These characteristics are directly attributable to the geometric averaging dampening effect (Zhao et al., 28 Jul 2025).

5. Practical Deployment and Code Availability

The practical implementation of GMPO involves:

Computing token-level log-probabilities for both $\pi_\theta$ and $\pi_{old}$ on the training data.
Clipping per-token importance ratios within pre-specified thresholds.
Aggregating the clipped ratios using sequence-wise geometric means (with the appropriate advantage sign and magnitude scaling).
Backpropagating the resulting objective through the model.

The official codebase for GMPO is made available at https://github.com/callsys/GMPO, providing practical resources, pseudo-code, and reference implementations for training and evaluating on diverse RL and LLM fine-tuning benchmarks (Zhao et al., 28 Jul 2025).

6. Broader Implications and Theoretical Context

The adoption of geometric-mean based objectives fits within a larger trend of leveraging geometry and distributional structure for robust policy optimization. By focusing on the geometric rather than arithmetic aggregation of policy statistics, GMPO shares conceptual ground with mirror descent and information geometric policy gradients, though its primary stabilization benefit in high-variance, sequence-level RL is empirically established.

Parallel advances—such as using Fisher-Rao metrics in policy update constraints, or geometry-aware variance reduction—identify similar advantages of robust geometric averaging in RL (Lascu et al., 4 Jun 2025). GMPO’s principles may plausibly be adapted or extended to other reinforcement learning modalities beyond LLMs, particularly wherever per-decision or per-token ratios govern policy stability and the system is vulnerable to outlier-driven reward dynamics.

7. Significance in the Development of Sequence-Level RL and LLMs

GMPO directly addresses a central challenge in policy optimization for large, compositional spaces: controlling instability arising from rare or extreme local observations. By ensuring stable and moderate updates, GMPO enables more reliable and effective reinforcement learning for LLMs and other models whose policies involve high-dimensional, structured output distributions. The method’s ability to promote exploration while constraining policy deviation—reflected in more stable KL and entropy dynamics—has been empirically validated as necessary for success in advanced mathematical and multimodal reasoning tasks, critical for the ongoing scaling of LLM reasoning ability (Zhao et al., 28 Jul 2025).

In summary, Geometric-Mean Policy Optimization represents a principled, empirically validated, and practical method for stabilizing policy improvement in reinforcement learning by leveraging geometric, rather than arithmetic, aggregation of importance-weighted updates. Its theoretical and experimental advantages make it a compelling choice for reinforcement learning with LLMs and other structured prediction domains where sensitivity to outliers can be problematic.

PDF Markdown Chat (Pro)

References (2)

Geometric-Mean Policy Optimization (2025)

PPO in the Fisher-Rao geometry (2025)

Follow Topic

Get notified by email when new papers are published related to Geometric-Mean Policy Optimization (GMPO).