Papers
Topics
Authors
Recent
Search
2000 character limit reached

Power-Mean Policy Optimization

Updated 6 February 2026
  • PMPO is a generalized policy optimization framework that unifies geometric and arithmetic mean methods through a dynamic power-mean exponent.
  • It leverages a clip-aware effective sample size matching rule to balance aggressive and conservative updates based on trajectory reliability.
  • Empirical benchmarks in mathematical reasoning tasks demonstrate that PMPO achieves faster convergence and improved performance over fixed-parameter methods.

Power-Mean Policy Optimization (PMPO) is a generalized policy optimization framework for group-based reinforcement learning (RL) in LLM reasoning tasks. PMPO unifies the geometric mean (GMPO) and arithmetic mean (GRPO) approaches under a single parameterized method, allowing dynamic adaptation of aggregation geometry through the power-mean exponent pp. This flexible geometry enables effective interpolation between aggressive and conservative update regimes on a per-trajectory basis, adapting to the heterogeneous and evolving stability of generated trajectories. PMPO leverages an adaptive, scale-invariant effective sample size (ESS) matching rule driven by observed clipping behavior in PPO-style training, resulting in empirically superior performance on mathematical reasoning RLHF benchmarks (Zhao et al., 30 Jan 2026).

1. The Power-Mean Operator and Parameterization

The power-mean, or generalized mean, of NN positive scalars a1,,aN>0a_1, \dots, a_N > 0 with order pp is defined by:

Mp(a1,,aN)=(1Ni=1Naip)1/pM_p(a_1,\dots,a_N) = \left(\frac{1}{N}\sum_{i=1}^N a_i^p\right)^{1/p}

Two canonical special cases are:

  • Arithmetic mean: Mp=1(a)=1NiaiM_{p=1}(a) = \frac{1}{N}\sum_i a_i (as in GRPO)
  • Geometric mean: limp0Mp(a)=exp(1Nilnai)\lim_{p\to 0} M_p(a) = \exp\left(\frac{1}{N}\sum_i \ln a_i\right) (as in GMPO)

As pp increases, MpM_p emphasizes the largest aia_i more heavily, producing more aggressive updates. As pp decreases toward $0$, the mean approaches the geometric mean, yielding conservative aggregation. This one-parameter family provides a continuum of aggregation geometries, subsuming existing approaches.

2. PMPO Objective and Gradient Computation

PMPO operates in a group-based bandit RL setting applied to language modeling for mathematical reasoning. Given a prompt xx, the previous policy πold\pi_\text{old} is used to sample KK independent full reasoning trajectories {τ(k)}\{\tau^{(k)}\}, each resulting in a scalar reward R(k){0,1}R^{(k)} \in \{0,1\}. The trajectory-level advantage is set as:

A(k)=R(k)1Kj=1KR(j)A^{(k)} = R^{(k)} - \frac{1}{K} \sum_{j=1}^K R^{(j)}

For each token tt in the generated response of trajectory kk, define the log-probability difference (log-diff):

Δt(k)=logπθ(at(k)st(k))logπold(at(k)st(k))\Delta \ell_t^{(k)} = \log\pi_\theta(a_t^{(k)} \mid s_t^{(k)}) - \log\pi_\text{old}(a_t^{(k)} \mid s_t^{(k)})

The per-token importance ratio is rt(k)=exp(Δt(k))r_t^{(k)} = \exp(\Delta \ell_t^{(k)}).

For each trajectory, the effective importance ratio is aggregated via the power mean:

r^p(k)=Mp(r1(k),...,rn(k)(k))=(1n(k)tS(k)[rt(k)]p)1/p\hat r_p^{(k)} = M_p(r_1^{(k)}, ..., r_{n^{(k)}}^{(k)}) = \left(\frac{1}{n^{(k)}}\sum_{t\in S^{(k)}} [r_t^{(k)}]^p\right)^{1/p}

The surrogate loss is:

Lp(θ)=1Kk=1KA(k)r^p(k)L_p(\theta) = -\frac{1}{K} \sum_{k=1}^K A^{(k)} \hat r_p^{(k)}

PPO-style log-domain clipping is applied: each Δt(k)\Delta \ell_t^{(k)} is clipped with threshold ϵ\epsilon to produce Δ~t(k)\widetilde{\Delta\ell}_t^{(k)}, after which the power mean is re-computed using the clipped values.

The gradient of LpL_p w.r.t. θ\theta gives a weighted token-level policy gradient where wt(k)(p)=softmaxtS(k)(pΔt(k))w_t^{(k)}(p) = \operatorname{softmax}_{t\in S^{(k)}}(p\Delta\ell_t^{(k)}), with pp acting as the inverse temperature. As p0p\to 0, wtw_t approaches uniform weighting (conservative); as p1p\to 1, weighting sharpens toward the arithmetic mean (aggressive). For p>1p>1, the update becomes even more peaked.

3. Adaptive Control of Power-Mean via Clip-Aware ESS

PMPO introduces a dynamic scheme for setting the aggregation order pp on a per-trajectory basis, reflecting the local reliability and heterogeneity of trajectory gradients. The procedure is as follows:

  • Clipping Fraction: Compute the fraction of tokens in trajectory kk where Δt(k)>ϵESS|\Delta\ell_t^{(k)}| > \epsilon_\mathrm{ESS} (with ϵESS=0.1\epsilon_\mathrm{ESS}=0.1 typically).
  • Target ESS: Map the observed clipping fraction fclip(k)f_\text{clip}^{(k)} (in [0,1][0,1]) to a target normalized effective sample size (ESS) n[1/n,1]n^* \in [1/n, 1] via n=1/n+fclip(11/n)n^* = 1/n + f_\text{clip}(1-1/n).
  • Order Parameter Selection (pp): Solve for pp such that the normalized ESS under pp, given by ESSnorm(p)=1nt=1n[wt(p)]2\mathrm{ESS}_{\mathrm{norm}}(p) = \frac{1}{n \sum_{t=1}^n [w_t(p)]^2}, matches nn^*. This is performed efficiently by monotonic bisection.

This ESS-matching mechanism provably interpolates between the extremes: p=1p=1 (when fclip=0f_\text{clip}=0) and p0p\to 0 (when fclip=1f_\text{clip}=1), allowing the algorithm to adaptively balance update aggression and conservatism in response to the trust-region saturation signaled by clipping (Zhao et al., 30 Jan 2026).

4. Training Workflow and Implementation

The training procedure for PMPO in each minibatch consists of:

  1. Sampling KK trajectories under πold\pi_\text{old}, computing respective rewards and advantages.
  2. For each trajectory:
    • Compute per-token log-diffs and apply log-domain clipping.
    • Determine clipping fraction and target ESS.
    • Numerically solve for pp to match target ESS.
    • Aggregate per-token importance ratios via the power mean at order pp.
  3. Form the minibatch surrogate loss:

L(θ)=1Kk=1KA(k)r^p(k)L(\theta) = -\frac{1}{K} \sum_{k=1}^K A^{(k)} \widehat r_p^{(k)}

Take gradient and update parameters.

  1. Periodically synchronize πoldπθ\pi_\text{old} \leftarrow \pi_\theta.

Ablation studies confirm that:

  • Fixed p=0.5p=0.5 ("static compromise") underperforms the adaptive scheme.
  • Omitting log-domain clipping destabilizes training.
  • Alternative heuristics (length-based, entropy-based, schedule) are inferior to clip-aware ESS matching.

5. Theoretical Properties and Behavior

Theoretical analysis reveals key properties:

  • Monotonicity: For p1<p2p_1 < p_2, Mp1(a)Mp2(a)M_{p_1}(a) \le M_{p_2}(a).
  • Softmax-Temperature Interpretation: The power mean order pp serves as the inverse temperature in the softmax weighting of token log-diffs, interpolating from uniform to sharply peaked weightings.
  • ESS Monotonicity: Normalized ESS decreases strictly with increasing pp, enabling reliable numerical inversion for the adaptive mechanism.

This unification enables one algorithm to smoothly interpolate between GRPO and GMPO, adapting to the quality of information in trajectory gradients.

6. Empirical Results and Benchmarks

Empirical evaluation was conducted on mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench, using both Qwen2.5-Math (1.5B, 7B) and DeepSeek-R1-Distill-Qwen-7B. All models used group size K=8K=8, sequence length 3000, and standard clipping thresholds.

Method 7B Pass@1 (%)
GRPO (p=1p=1) 51.4
GMPO (p0p\to 0) 52.7
PMPO (adaptive) 54.2

On Qwen2.5-Math-7B, PMPO outperformed GMPO across sub-benchmarks by margins: +5.4% (AIME24), +7.3% (AMC), +1.8% (MATH500), +1.4% (Minerva), +3.1% (Oly.). PMPO exhibited faster convergence, higher early-stage entropy (exploration), and avoided the gradient-norm spikes observed in GRPO.

7. Significance and Scope

PMPO provides a unified and adaptive framework for group-based RL in language modeling tasks where trajectory quality and stability vary widely. By dynamically modulating the aggregation geometry per trajectory using the power-mean exponent and a principled, clip-aware ESS matching approach, PMPO accommodates both stable and noisy update regimes within a single algorithm. Empirical results demonstrate state-of-the-art performance on challenging mathematical reasoning RLHF tasks, and ablation studies support the necessity of adaptive geometry and log-domain clipping. PMPO thus addresses inherent limitations of fixed-geometry aggregation and sets a new standard for robust RL in group-based LLM optimization (Zhao et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Power-Mean Policy Optimization (PMPO).