Power-Mean Policy Optimization

Updated 6 February 2026

PMPO is a generalized policy optimization framework that unifies geometric and arithmetic mean methods through a dynamic power-mean exponent.
It leverages a clip-aware effective sample size matching rule to balance aggressive and conservative updates based on trajectory reliability.
Empirical benchmarks in mathematical reasoning tasks demonstrate that PMPO achieves faster convergence and improved performance over fixed-parameter methods.

Power-Mean Policy Optimization (PMPO) is a generalized policy optimization framework for group-based reinforcement learning (RL) in LLM reasoning tasks. PMPO unifies the geometric mean (GMPO) and arithmetic mean (GRPO) approaches under a single parameterized method, allowing dynamic adaptation of aggregation geometry through the power-mean exponent $p$ . This flexible geometry enables effective interpolation between aggressive and conservative update regimes on a per-trajectory basis, adapting to the heterogeneous and evolving stability of generated trajectories. PMPO leverages an adaptive, scale-invariant effective sample size (ESS) matching rule driven by observed clipping behavior in PPO-style training, resulting in empirically superior performance on mathematical reasoning RLHF benchmarks (Zhao et al., 30 Jan 2026).

1. The Power-Mean Operator and Parameterization

The power-mean, or generalized mean, of $N$ positive scalars $a_1, \dots, a_N > 0$ with order $p$ is defined by:

$M_p(a_1,\dots,a_N) = \left(\frac{1}{N}\sum_{i=1}^N a_i^p\right)^{1/p}$

Two canonical special cases are:

Arithmetic mean: $M_{p=1}(a) = \frac{1}{N}\sum_i a_i$ (as in GRPO)
Geometric mean: $\lim_{p\to 0} M_p(a) = \exp\left(\frac{1}{N}\sum_i \ln a_i\right)$ (as in GMPO)

As $p$ increases, $M_p$ emphasizes the largest $a_i$ more heavily, producing more aggressive updates. As $p$ decreases toward $0$, the mean approaches the geometric mean, yielding conservative aggregation. This one-parameter family provides a continuum of aggregation geometries, subsuming existing approaches.

2. PMPO Objective and Gradient Computation

PMPO operates in a group-based bandit RL setting applied to language modeling for mathematical reasoning. Given a prompt $x$ , the previous policy $\pi_\text{old}$ is used to sample $K$ independent full reasoning trajectories $\{\tau^{(k)}\}$ , each resulting in a scalar reward $R^{(k)} \in \{0,1\}$ . The trajectory-level advantage is set as:

$A^{(k)} = R^{(k)} - \frac{1}{K} \sum_{j=1}^K R^{(j)}$

For each token $t$ in the generated response of trajectory $k$ , define the log-probability difference (log-diff):

$\Delta \ell_t^{(k)} = \log\pi_\theta(a_t^{(k)} \mid s_t^{(k)}) - \log\pi_\text{old}(a_t^{(k)} \mid s_t^{(k)})$

The per-token importance ratio is $r_t^{(k)} = \exp(\Delta \ell_t^{(k)})$ .

For each trajectory, the effective importance ratio is aggregated via the power mean:

$\hat r_p^{(k)} = M_p(r_1^{(k)}, ..., r_{n^{(k)}}^{(k)}) = \left(\frac{1}{n^{(k)}}\sum_{t\in S^{(k)}} [r_t^{(k)}]^p\right)^{1/p}$

The surrogate loss is:

$L_p(\theta) = -\frac{1}{K} \sum_{k=1}^K A^{(k)} \hat r_p^{(k)}$

PPO-style log-domain clipping is applied: each $\Delta \ell_t^{(k)}$ is clipped with threshold $\epsilon$ to produce $\widetilde{\Delta\ell}_t^{(k)}$ , after which the power mean is re-computed using the clipped values.

The gradient of $L_p$ w.r.t. $\theta$ gives a weighted token-level policy gradient where $w_t^{(k)}(p) = \operatorname{softmax}_{t\in S^{(k)}}(p\Delta\ell_t^{(k)})$ , with $p$ acting as the inverse temperature. As $p\to 0$ , $w_t$ approaches uniform weighting (conservative); as $p\to 1$ , weighting sharpens toward the arithmetic mean (aggressive). For $p>1$ , the update becomes even more peaked.

3. Adaptive Control of Power-Mean via Clip-Aware ESS

PMPO introduces a dynamic scheme for setting the aggregation order $p$ on a per-trajectory basis, reflecting the local reliability and heterogeneity of trajectory gradients. The procedure is as follows:

Clipping Fraction: Compute the fraction of tokens in trajectory $k$ where $|\Delta\ell_t^{(k)}| > \epsilon_\mathrm{ESS}$ (with $\epsilon_\mathrm{ESS}=0.1$ typically).
Target ESS: Map the observed clipping fraction $f_\text{clip}^{(k)}$ (in $[0,1]$ ) to a target normalized effective sample size (ESS) $n^* \in [1/n, 1]$ via $n^* = 1/n + f_\text{clip}(1-1/n)$ .
Order Parameter Selection ( $p$ ): Solve for $p$ such that the normalized ESS under $p$ , given by $\mathrm{ESS}_{\mathrm{norm}}(p) = \frac{1}{n \sum_{t=1}^n [w_t(p)]^2}$ , matches $n^*$ . This is performed efficiently by monotonic bisection.

This ESS-matching mechanism provably interpolates between the extremes: $p=1$ (when $f_\text{clip}=0$ ) and $p\to 0$ (when $f_\text{clip}=1$ ), allowing the algorithm to adaptively balance update aggression and conservatism in response to the trust-region saturation signaled by clipping (Zhao et al., 30 Jan 2026).

4. Training Workflow and Implementation

The training procedure for PMPO in each minibatch consists of:

Sampling $K$ trajectories under $\pi_\text{old}$ , computing respective rewards and advantages.
For each trajectory:
- Compute per-token log-diffs and apply log-domain clipping.
- Determine clipping fraction and target ESS.
- Numerically solve for $p$ to match target ESS.
- Aggregate per-token importance ratios via the power mean at order $p$ .
Form the minibatch surrogate loss:

$L(\theta) = -\frac{1}{K} \sum_{k=1}^K A^{(k)} \widehat r_p^{(k)}$

Take gradient and update parameters.

Periodically synchronize $\pi_\text{old} \leftarrow \pi_\theta$ .

Ablation studies confirm that:

Fixed $p=0.5$ ("static compromise") underperforms the adaptive scheme.
Omitting log-domain clipping destabilizes training.
Alternative heuristics (length-based, entropy-based, schedule) are inferior to clip-aware ESS matching.

5. Theoretical Properties and Behavior

Theoretical analysis reveals key properties:

Monotonicity: For $p_1 < p_2$ , $M_{p_1}(a) \le M_{p_2}(a)$ .
Softmax-Temperature Interpretation: The power mean order $p$ serves as the inverse temperature in the softmax weighting of token log-diffs, interpolating from uniform to sharply peaked weightings.
ESS Monotonicity: Normalized ESS decreases strictly with increasing $p$ , enabling reliable numerical inversion for the adaptive mechanism.

This unification enables one algorithm to smoothly interpolate between GRPO and GMPO, adapting to the quality of information in trajectory gradients.

6. Empirical Results and Benchmarks

Empirical evaluation was conducted on mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench, using both Qwen2.5-Math (1.5B, 7B) and DeepSeek-R1-Distill-Qwen-7B. All models used group size $K=8$ , sequence length 3000, and standard clipping thresholds.

Method	7B Pass@1 (%)
GRPO ( $p=1$ )	51.4
GMPO ( $p\to 0$ )	52.7
PMPO (adaptive)	54.2

On Qwen2.5-Math-7B, PMPO outperformed GMPO across sub-benchmarks by margins: +5.4% (AIME24), +7.3% (AMC), +1.8% (MATH500), +1.4% (Minerva), +3.1% (Oly.). PMPO exhibited faster convergence, higher early-stage entropy (exploration), and avoided the gradient-norm spikes observed in GRPO.

7. Significance and Scope

PMPO provides a unified and adaptive framework for group-based RL in language modeling tasks where trajectory quality and stability vary widely. By dynamically modulating the aggregation geometry per trajectory using the power-mean exponent and a principled, clip-aware ESS matching approach, PMPO accommodates both stable and noisy update regimes within a single algorithm. Empirical results demonstrate state-of-the-art performance on challenging mathematical reasoning RLHF tasks, and ablation studies support the necessity of adaptive geometry and log-domain clipping. PMPO thus addresses inherent limitations of fixed-geometry aggregation and sets a new standard for robust RL in group-based LLM optimization (Zhao et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Power-Mean Policy Optimization (PMPO).