Power-Mean Policy Optimization
- PMPO is a generalized policy optimization framework that unifies geometric and arithmetic mean methods through a dynamic power-mean exponent.
- It leverages a clip-aware effective sample size matching rule to balance aggressive and conservative updates based on trajectory reliability.
- Empirical benchmarks in mathematical reasoning tasks demonstrate that PMPO achieves faster convergence and improved performance over fixed-parameter methods.
Power-Mean Policy Optimization (PMPO) is a generalized policy optimization framework for group-based reinforcement learning (RL) in LLM reasoning tasks. PMPO unifies the geometric mean (GMPO) and arithmetic mean (GRPO) approaches under a single parameterized method, allowing dynamic adaptation of aggregation geometry through the power-mean exponent . This flexible geometry enables effective interpolation between aggressive and conservative update regimes on a per-trajectory basis, adapting to the heterogeneous and evolving stability of generated trajectories. PMPO leverages an adaptive, scale-invariant effective sample size (ESS) matching rule driven by observed clipping behavior in PPO-style training, resulting in empirically superior performance on mathematical reasoning RLHF benchmarks (Zhao et al., 30 Jan 2026).
1. The Power-Mean Operator and Parameterization
The power-mean, or generalized mean, of positive scalars with order is defined by:
Two canonical special cases are:
- Arithmetic mean: (as in GRPO)
- Geometric mean: (as in GMPO)
As increases, emphasizes the largest more heavily, producing more aggressive updates. As decreases toward $0$, the mean approaches the geometric mean, yielding conservative aggregation. This one-parameter family provides a continuum of aggregation geometries, subsuming existing approaches.
2. PMPO Objective and Gradient Computation
PMPO operates in a group-based bandit RL setting applied to language modeling for mathematical reasoning. Given a prompt , the previous policy is used to sample independent full reasoning trajectories , each resulting in a scalar reward . The trajectory-level advantage is set as:
For each token in the generated response of trajectory , define the log-probability difference (log-diff):
The per-token importance ratio is .
For each trajectory, the effective importance ratio is aggregated via the power mean:
The surrogate loss is:
PPO-style log-domain clipping is applied: each is clipped with threshold to produce , after which the power mean is re-computed using the clipped values.
The gradient of w.r.t. gives a weighted token-level policy gradient where , with acting as the inverse temperature. As , approaches uniform weighting (conservative); as , weighting sharpens toward the arithmetic mean (aggressive). For , the update becomes even more peaked.
3. Adaptive Control of Power-Mean via Clip-Aware ESS
PMPO introduces a dynamic scheme for setting the aggregation order on a per-trajectory basis, reflecting the local reliability and heterogeneity of trajectory gradients. The procedure is as follows:
- Clipping Fraction: Compute the fraction of tokens in trajectory where (with typically).
- Target ESS: Map the observed clipping fraction (in ) to a target normalized effective sample size (ESS) via .
- Order Parameter Selection (): Solve for such that the normalized ESS under , given by , matches . This is performed efficiently by monotonic bisection.
This ESS-matching mechanism provably interpolates between the extremes: (when ) and (when ), allowing the algorithm to adaptively balance update aggression and conservatism in response to the trust-region saturation signaled by clipping (Zhao et al., 30 Jan 2026).
4. Training Workflow and Implementation
The training procedure for PMPO in each minibatch consists of:
- Sampling trajectories under , computing respective rewards and advantages.
- For each trajectory:
- Compute per-token log-diffs and apply log-domain clipping.
- Determine clipping fraction and target ESS.
- Numerically solve for to match target ESS.
- Aggregate per-token importance ratios via the power mean at order .
- Form the minibatch surrogate loss:
Take gradient and update parameters.
- Periodically synchronize .
Ablation studies confirm that:
- Fixed ("static compromise") underperforms the adaptive scheme.
- Omitting log-domain clipping destabilizes training.
- Alternative heuristics (length-based, entropy-based, schedule) are inferior to clip-aware ESS matching.
5. Theoretical Properties and Behavior
Theoretical analysis reveals key properties:
- Monotonicity: For , .
- Softmax-Temperature Interpretation: The power mean order serves as the inverse temperature in the softmax weighting of token log-diffs, interpolating from uniform to sharply peaked weightings.
- ESS Monotonicity: Normalized ESS decreases strictly with increasing , enabling reliable numerical inversion for the adaptive mechanism.
This unification enables one algorithm to smoothly interpolate between GRPO and GMPO, adapting to the quality of information in trajectory gradients.
6. Empirical Results and Benchmarks
Empirical evaluation was conducted on mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench, using both Qwen2.5-Math (1.5B, 7B) and DeepSeek-R1-Distill-Qwen-7B. All models used group size , sequence length 3000, and standard clipping thresholds.
| Method | 7B Pass@1 (%) |
|---|---|
| GRPO () | 51.4 |
| GMPO () | 52.7 |
| PMPO (adaptive) | 54.2 |
On Qwen2.5-Math-7B, PMPO outperformed GMPO across sub-benchmarks by margins: +5.4% (AIME24), +7.3% (AMC), +1.8% (MATH500), +1.4% (Minerva), +3.1% (Oly.). PMPO exhibited faster convergence, higher early-stage entropy (exploration), and avoided the gradient-norm spikes observed in GRPO.
7. Significance and Scope
PMPO provides a unified and adaptive framework for group-based RL in language modeling tasks where trajectory quality and stability vary widely. By dynamically modulating the aggregation geometry per trajectory using the power-mean exponent and a principled, clip-aware ESS matching approach, PMPO accommodates both stable and noisy update regimes within a single algorithm. Empirical results demonstrate state-of-the-art performance on challenging mathematical reasoning RLHF tasks, and ablation studies support the necessity of adaptive geometry and log-domain clipping. PMPO thus addresses inherent limitations of fixed-geometry aggregation and sets a new standard for robust RL in group-based LLM optimization (Zhao et al., 30 Jan 2026).