Weak Policy Optimization (WPO) in RL and RLHF

Updated 12 November 2025

Weak Policy Optimization (WPO) is a reinforcement learning framework that optimizes policy distributions using Wasserstein gradient flows or weighted preference schemes.
It employs a particle-based approach that combines Stein variational updates with Wasserstein penalties to ensure trust-region benefits and convergence guarantees.
This method enhances sample efficiency and convergence speed compared to traditional algorithms like TRPO, PPO, and SAC, proving effective in both classical RL and RLHF contexts.

Weak Policy Optimization (WPO) refers to a class of policy optimization frameworks in reinforcement learning (RL) that perform optimization directly in the space of policy distributions, rather than parameter vectors, using metric structures such as the Wasserstein distance. A related recent acronym, Weighted Preference Optimization (also WPO), denotes a specific reweighting scheme for policy learning from human or off-policy preferences within the RLHF paradigm. Both frameworks leverage “weak topology” or “distributional” viewpoints distinct from standard parameter-space optimization. This entry covers both foundational and modern WPO variants as introduced in "Policy Optimization as Wasserstein Gradient Flows" (Zhang et al., 2018) and "WPO: Enhancing RLHF with Weighted Preference Optimization" (Zhou et al., 17 Jun 2024), tracing their mathematical formalism, algorithmic formulation, connections to existing methods, and empirical impact.

1. Mathematical Foundations: Policy Optimization as Wasserstein Gradient Flows

Weak Policy Optimization, as introduced in (Zhang et al., 2018), recasts policy optimization into probability-measure space. Let $\Theta \subset \mathbb{R}^d$ denote the parameter space of policies and $\mathcal{P}(\Theta)$ the space of Borel probability measures over $\Theta$ .

Optimization Objective

Define the objective functional over distributions $\mu \in \mathcal{P}(\Theta)$ :

$F(\mu) = -\int_{\Theta} J(\pi_\theta)\, d\mu(\theta) + \int_{\Theta} \mu(\theta)\log \mu(\theta)\, d\theta$

where $J(\pi_\theta)$ is the expected return, and the second term is (up to a constant) the negative Shannon entropy. This is equivalent, up to constants, to the Kullback–Leibler divergence:

$F(\mu) = \operatorname{KL}(\mu \parallel p_*) \quad\text{with}\quad p_*(\theta) \propto \exp(J(\pi_\theta)/\alpha)$

with temperature $\alpha>0$ .

Wasserstein Geometry

The metric structure is endowed by the 2-Wasserstein distance:

$W_2^2(\mu,\nu) = \inf_{\gamma \in \Gamma(\mu,\nu)} \int_{\Theta\times\Theta} \|\theta-\theta'\|^2 d\gamma(\theta,\theta')$

where $\Gamma(\mu,\nu)$ denotes the set of couplings between $\mu$ and $\nu$ . The functional $F(\mu)$ is displacement convex and well-posed under this geometry, leading to unique solutions and exponential convergence properties in continuous time.

2. Algorithmic Formulation: JKO Scheme and Particle Approximation

The weak policy optimization approach operationalizes distributional gradient descent via the Jordan–Kinderlehrer–Otto (JKO) variational implicit Euler scheme:

$\mu_{k+1} = \arg\min_{\mu \in \mathcal{P}(\Theta)} \left\{ F(\mu) + \frac{1}{2\tau} W_2^2(\mu, \mu_k) \right\}$

where $\tau > 0$ is a step size.

Particle-based Numerical Method

The practical algorithm employs a particle approximation:

Represent $\mu_k \approx \frac{1}{M} \sum_{i=1}^M \delta_{\theta_k^i}$
Each particle $\theta_k^i$ $θ_{k}^{i}$ is updated using a combination of:
- A Stein variational term for the KL objective (Stein Variational Gradient Descent, SVGD),
- A Wasserstein penalty term, typically entropically regularized.

Explicit Particle Update

For particle $i$ , the update aggregates two forces:

SVGD-gradient:

$g_F^i \propto \frac{1}{M} \sum_{j=1}^M \left[ K(\theta^j, \theta^i) \nabla_{\theta^j} \log p_*(\theta^j) + \nabla_{\theta^j} K(\theta^j, \theta^i) \right]$

Wasserstein term:

$g_W^i \propto \sum_{j=1}^M 2\left(1 - \frac{\|\theta^i-\theta_k^j\|^2}{\lambda}\right) \exp\left(-\frac{\|\theta^i-\theta_k^j\|^2}{\lambda}\right) (\theta^i - \theta_k^j)$

with $\lambda$ the entropic scale.

The overall update:

$\theta_{k+1}^i = \theta_k^i + h\cdot(-\gamma_F g_F^i - \gamma_W g_W^i)$

Hyperparameters include the number of particles $M$ , the step size $h$ , and the kernel bandwidth.

3. Relationship to Classical and Modern Policy Optimization

Weak Policy Optimization generalizes and unifies several reinforcement learning algorithms:

Method	Core Principle	Link to WPO
TRPO	KL trust-region penalty	Small-step WPO with linearization recovers TRPO surrogate optimization
PPO	Clipped likelihood ratio surrogate	Approximates proximal-Wasserstein step with a clipped KL surrogate
SAC/Soft-Q	Maximum-entropy RL; $\pi \leftarrow \exp(Q)/Z$	WPO's Wasserstein penalty enforces trust region in policy space
SVPG	Stein variational updates for Bayesian RL	WPO augments SVPG with Wasserstein geometry (proximal term)

This reveals that WPO incorporates regularization, trust-region, and variational perspectives within a distributional optimization framework.

4. Weighted Preference Optimization: WPO for RLHF with Off-policy Data

A distinct line of work uses the acronym WPO to denote Weighted Preference Optimization within RLHF settings (Zhou et al., 17 Jun 2024). Here, policy objectives are determined by human- or model-annotated pairwise preferences. The focus is on addressing the "distributional gap" between the data-collection policy $\pi_b$ and the current policy $\pi_\theta$ .

WPO Objective in RLHF

To more accurately estimate the on-policy expected loss from off-policy data, WPO importance weights each sample:

For each preference triplet $(x, y_w, y_l)$ in dataset $D$ , use

$w(x, y_w, y_l) = \frac{\pi_\theta(y_w|x)\,\pi_\theta(y_l|x)}{\pi_b(y_w|x)\,\pi_b(y_l|x)}$

In practice, $\pi_b$ is often unknown, so weights are simplified to

$w(x,y) = \pi_\theta(y|x)$

Possibly with length normalization:

$w(x, y) = \exp\left( \frac{1}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t | x, y_{<t}) \right)$

The weighted loss is

$L_{\text{WPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ w(x, y_w)w(x, y_l) \log \sigma[s_\theta(x, y_w)-s_\theta(x, y_l)] \right]$

where $s_\theta(x, y)$ is a score function, e.g., a scaled log-probability ratio.

Stability Techniques

Weights are computed in log space, and "detached" from the gradient to prevent destabilizing feedback loops. Empirical alignment methods (greedy and sampled) are used to normalize weights. No additional clipping is required. Hyperparameters are reported for popular models (Mistral-7B, Llama-3-8B).

5. Empirical Results and Performance Gains

WPO (in both senses) has demonstrated consistent improvements on standard benchmarks:

Bayesian Regression (UCI): WPO improves test log-likelihoods by 1–5% over SVGD.
Indirect Policy Learning (IP-WGF): 20–50% faster convergence and higher reward relative to SVPG.
Direct Policy Learning (DP-WGF-V, MuJoCo): Achieves reward thresholds in 30–50% fewer samples than SAC or TRPO. On Humanoid, DP-WGF-V achieves ≈3,100 average return in ≈18,000 episodes (vs. SAC’s 2,200/26,000 and TRPO’s 5,400/32,000).

Mistral-7B (off-policy): DPO length-controlled win 20.6% vs GPT-4-turbo; WPO 24.4% (+3.8 points). MT-bench jump from 50% to 60.1%.
Llama-3-8B (off-policy): DPO 28.2%→WPO 33.8% length-controlled win.
Hybrid (on+off-policy): WPO delivers consistent 1–10 point increases across all metrics; for Llama-3, WPO achieves a new SOTA of 48.6% against GPT-4-turbo.

Ablation studies show sampled weight alignment is essential; weighting only the “loser” nearly matches full WPO, but only weighting the “winner” underperforms DPO. These findings indicate the empirical importance of negative preference samples in this weighting framework.

6. Practical and Theoretical Considerations

The WPO (Wasserstein-based) method yields a convex subproblem per iteration (for displacement convex $F$ ), is inherently compatible with deep neural approximations via particles, and unifies trust-region, variational, and distributional optimization perspectives. The empirical loss-based WPO (RLHF) approach is loss-agnostic and can be integrated with various preference-based training objectives (e.g., IPO, SimPO, KTO).

Computational complexity is dominated by particle–particle interactions ( $O(M^2 d)$ per step for the Wasserstein-based WPO), but minibatching and kernel acceleration can alleviate this overhead. In RLHF, the leading cost is the forward computation of sequence probabilities and scores.

7. Significance and Synthesis

WPO (in both the distributional and preference-reweighting incarnations) offers a principled framework for policy optimization grounded in both optimal transport theory and importance sampling, with theoretical assurances (convexity, global optimality, monotonic improvement) and practical, reproducible gains across diverse problem domains. It establishes a unifying lens on classical RL algorithms, reveals the operational impact of metric-aware trust regions, and, in human preference optimization, closes the longstanding distributional gap of off-policy RLHF with a lightweight, model-agnostic reweighting scheme. The results collectively advance the state-of-the-art in sample efficiency, convergence speed, and real-world benchmark performance.

PDF Markdown Chat (Pro)

References (2)

Policy Optimization as Wasserstein Gradient Flows (2018)

WPO: Enhancing RLHF with Weighted Preference Optimization (2024)

Follow Topic

Get notified by email when new papers are published related to Weak Policy Optimization (WPO).

Weak Policy Optimization (WPO) in RL and RLHF

1. Mathematical Foundations: Policy Optimization as Wasserstein Gradient Flows

Optimization Objective

Wasserstein Geometry

2. Algorithmic Formulation: JKO Scheme and Particle Approximation

Particle-based Numerical Method

Explicit Particle Update

3. Relationship to Classical and Modern Policy Optimization

4. Weighted Preference Optimization: WPO for RLHF with Off-policy Data

WPO Objective in RLHF

Stability Techniques

5. Empirical Results and Performance Gains

Particle-based WPO (Zhang et al., 2018)

Weighted Preference Optimization (Zhou et al., 17 Jun 2024)

6. Practical and Theoretical Considerations

7. Significance and Synthesis

Follow Topic

Continue Learning

Weak Policy Optimization (WPO) in RL and RLHF

1. Mathematical Foundations: Policy Optimization as Wasserstein Gradient Flows

Optimization Objective

Wasserstein Geometry

2. Algorithmic Formulation: JKO Scheme and Particle Approximation

Particle-based Numerical Method

Explicit Particle Update

3. Relationship to Classical and Modern Policy Optimization

4. Weighted Preference Optimization: WPO for RLHF with Off-policy Data

WPO Objective in RLHF

Stability Techniques

5. Empirical Results and Performance Gains

Particle-based WPO (Zhang et al., 2018)

Weighted Preference Optimization (Zhou et al., 17 Jun 2024)

6. Practical and Theoretical Considerations

7. Significance and Synthesis

Follow Topic

Continue Learning

Related Topics