Ranking Policy Gradient Methods

Updated 9 October 2025

Ranking Policy Gradient (RPG) is a reinforcement learning approach that optimizes policies using the relative ordering of actions, leading to improved sample efficiency and controlled fairness.
RPG methods, employing techniques like pairwise and listwise ranking, reduce gradient variance and enable effective off-policy imitation through reparameterization and imitation learning.
Applications of RPG span learning-to-rank, robust control in robotics, and policy customization, demonstrating practical benefits in fairness constraints and efficient policy adaptation.

Ranking Policy Gradient (RPG) is a class of reinforcement learning and structured prediction methods that optimize policies by leveraging the ordinal structure of actions, rankings, or outputs, rather than relying solely on absolute value estimations. In RPG, the optimization target is typically defined over relative orderings—such as the probability of one action (or ranking outcome) being preferred over another—leading to improved sample efficiency, lower variance, and, in many applications, direct control over fairness, diversity, or robustness constraints. RPG has found unique applications in learning-to-rank, sample-efficient RL, robust MDPs, and policy fine-tuning frameworks with explicit exposure or fairness constraints.

1. Methodological Foundations and Core Formulations

RPG methods distinguish themselves from standard policy gradient (PG) approaches by replacing the estimation of absolute expected returns with the optimization of ranking-based objectives (Lin et al., 2019). A fundamental instance is the pairwise ranking policy, where for each state $s$ and pair of discrete actions $a_i, a_j$ , the model learns relative preferences specified by

$p_{ij} = \frac{\exp(\lambda(s, a_i) - \lambda(s, a_j))}{1 + \exp(\lambda(s, a_i) - \lambda(s, a_j))},$

where $\lambda(s, a)$ denotes a learned score function. The selection probability for action $a_i$ is then the product over its pairwise preferences: $\pi(a_i | s) = \prod_{j\ne i} p_{ij}.$ The gradient estimator for the expected long-term reward $J(\theta)$ consequently focuses on optimizing the ordering, as captured by: $\nabla_\theta J(\theta) \approx \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_t \nabla_\theta \left( \frac{1}{2}\sum_{j \ne i} (\lambda_i - \lambda_j) \right) r(\tau) \right].$ Alternative listwise RPG approaches employ a softmax as the ranking distribution: $\pi(a_i|s) = \frac{\exp(\lambda(s, a_i))}{\sum_j \exp(\lambda(s, a_j))}.$

In the context of learning-to-rank, RPG extends to distributions over whole permutations. For example, using the Plackett–Luce model, the probability of a ranking $r$ given query $q$ is

$\pi_\theta(r|q) = \prod_{i=1}^{n_q} \frac{\exp(h_\theta(x_{q, r(i)}))}{\sum_{j=i}^{n_q} \exp(h_\theta(x_{q, r(j)}))},$

with $h_\theta$ as a differentiable scorer (Singh et al., 2019, Gao et al., 2023, Yadav et al., 2019).

2. Off-Policy Imitation, Sample Efficiency, and Variance Reduction

RPG achieves heightened sample efficiency by reframing return maximization as an imitation or preference learning objective. A notable equivalence is established between maximizing a lower bound on the return and imitating a uniformly near-optimal policy (UNOP), effectively transforming the RL objective into supervised learning (Lin et al., 2019): $\max_\theta\;\;\sum_{s,a} p_{\pi^*}(s,a) \log \pi_\theta(a|s),$ where $p_{\pi^*}(s,a)$ is the state-action distribution induced by near-optimal trajectories. This decouples credit assignment over time, reduces gradient variance, and allows off-policy sample reuse.

Other RPG instantiations leverage direct reparameterization: model-free methods incorporate reward gradients without explicit dynamics models (Lan et al., 2021), and trajectory-level generative models enable multimodal policy learning, further enhancing exploration and data efficiency (Huang et al., 2023).

Empirically, RPG methods have demonstrated order-of-magnitude reductions in sample complexity for Atari 2600 games and superior convergence rates for MuJoCo and robotic control tasks (Lin et al., 2019, Lan et al., 2021, Huang et al., 2023). Pairwise or listwise ranking, combined with off-policy buffers of high-return trajectories, underlies the observed efficiency gains.

3. RPG for Fairness and Exposure Constraints in Ranking

Several RPG algorithms have been explicitly devised to enforce fairness and exposure constraints in learning-to-rank systems (Singh et al., 2019, Yadav et al., 2019), addressing both the exogenous bias from user interactions and endogenous bias from policy design. The Fair-PG-Rank and FULTR frameworks optimize: $\max_\pi\;\; \mathbb{E}_{q}\left[ U(\pi|q) \right] \quad \text{s.t.} \quad \mathbb{E}_{q}\left[ D(\pi|q) \right] \leq \delta$ with $U(\pi|q)$ an expected ranking metric (e.g., nDCG), and $D(\pi|q)$ a disparity measure of fairness (e.g., exposure proportional to item merit). Relaxing via Lagrangian duality yields the unconstrained trade-off objective

$J(\theta) = \frac{1}{N} \sum_q [ U(\pi_\theta|q) - \lambda D(\pi_\theta|q) ].$

Exposure is computed as the expected position bias, e.g. $v_j = 1/\log(1+j)$ , and proportionality constraints are imposed to control individual or group fairness: $\frac{\operatorname{Exposure}(d_i|\pi)}{M_i} \leq \frac{\operatorname{Exposure}(d_j|\pi)}{M_j}, \quad \forall M_i \geq M_j.$ FULTR further employs counterfactual inverse propensity scoring to correct for presentation bias and click noise (Yadav et al., 2019).

4. RPG in Robust and Constrained Policy Optimization

RPG has notable applications in robust reinforcement learning for polices under model or transition uncertainty (Kumar et al., 2023, Lin et al., 1 Jun 2024). In rectangular robust MDPs, RPG explicitly computes gradients under the worst-case transition kernel, exploiting closed-form expressions: $\partial_\pi \rho_U^{(p)} = \sum_{s,a} \left[ d^{(p)}_{p_0,\mu}(s) - c^{(p)}(s) \right] Q_U^{(p)}(s,a) \nabla_\theta \pi_s(a),$ where the worst-case occupation measure leverages rank-one perturbations of the nominal kernel.

Single-loop robust variants (SRPG) address the robust min-max objective

$\min_{\pi\in\Pi}\max_{p\in\mathcal{P}} J_\rho(\pi,p),$

with an intertwined update for both the policy and adversarial transition kernel. These methods utilize Moreau-Yosida regularization and exploit gradient dominance properties of the robust objective to obtain convergence guarantees with reduced computational cost (Lin et al., 1 Jun 2024).

In constrained MDPs, RPG-related primal-dual algorithms (RPG-PD) cast the constrained objective as an entropy-regularized saddle-point problem, ensuring last-iterate convergence, stability, and constraint satisfaction via single-time-scale coupled updates (Ding et al., 2023).

5. RPG for Policy Customization and Preference-Based Imitation

Residual Policy Gradient extends RPG to the domain of policy customization: given a pre-existing policy $\pi$ , RPG facilitates adaptation to new task-specific requirements while respecting the baseline behavior (Wang et al., 14 Mar 2025). The derived form of the soft policy gradient integrates entropy regularization,

$\triangle J(\pi_\theta) = \mathbb{E}_{\tau} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t' \geq t} \gamma^{t'-t} \left(r(s_{t'},a_{t'}) - \alpha \log \pi_\theta(a_{t'}|s_{t'})\right) \right],$

which, when combined with a KL-regularized objective, elegantly balances the prior’s properties with new reward structure.

Imitation learning approaches have instantiated RPG as a bi-level ranking game (Sikchi et al., 2022). Pairwise or preference-based rankings over behaviors are enforced via a novel loss: $L_k(\mathcal{D}^p; R) = \mathbb{E}_{(\rho_i,\rho_j) \sim \mathcal{D}^p} \left[ \mathbb{E}_{(s,a)\sim\rho_i} (R(s,a))^2 + \mathbb{E}_{(s,a)\sim\rho_j} (R(s,a) - k)^2 \right].$ Stackelberg game formulations allow for efficient sample usage, enhanced stability, and fusion of demonstrations with richer preference information.

6. Global Convergence and Theoretical Guarantees

Recent analyses reveal that RPG-type and standard policy gradient methods achieve global convergence under rank/order-preserving conditions in the function approximator. For finite-arm bandits with linear approximation, global convergence of Softmax PG and Natural PG is guaranteed if and only if the representation preserves the correct ordering of action values:

For natural PG: global convergence if the least squares projection of the reward vector preserves the optimal action’s rank.
For Softmax PG: convergence if there is non-domination and an ordering-preserving mapping from features to rewards (Mei et al., 2 Apr 2025).

Gradient domination properties have also been established for regularized RPG in entropy-regularized linear–quadratic control, enabling global convergence despite nonconvexity and stochasticity (Diaz et al., 3 Oct 2025).

7. Practical Implementations, Stability, and Extensions

Implementation of RPG variants encompasses techniques for variance reduction (e.g. baselines, entropy terms, Monte Carlo averaging), efficient sampling (Gumbel-Softmax for combinatorial ranking policies), and sample reuse. Stable reparameterization-based RPG methods now leverage PPO-inspired clipped surrogate objectives and KL regularization for high sample efficiency and robustness in continuous control (Zhong et al., 8 Aug 2025). Low-rank matrix approximations provide further parameter efficiency and improved convergence in large-scale or high-dimensional settings (Rozada et al., 27 May 2024, Rozada et al., 27 May 2024).

Empirically, RPG variants outperform standard baselines in sample efficiency and final-return quality across tasks such as Atari, MuJoCo locomotion/manipulation, multi-goal robotics under sparse reward, and text/document ranking for LLM retrieval (Lin et al., 2019, Huang et al., 2023, Gao et al., 2023, Luu et al., 2021).

RPG represents a principled rethinking of policy optimization by anchoring the learning signal in relative orderings and rankings, facilitating robust, fair, and sample-efficient learning in structured environments and policy design. Its intersection with fairness, robustness, and imitation has broadened the scope of policy gradient applications and deepened the theoretical understanding of convergence in complex RL and structured prediction domains.