Quantile Reward Policy Optimization (QRPO)

Updated 26 September 2025

Quantile Reward Policy Optimization is a reinforcement learning paradigm that optimizes specific reward quantiles to address risk and safety concerns.
It integrates value-based and policy-gradient methods using two-timescale updates and quantile regression for precise performance control.
QRPO enhances robustness and reliability in domains like financial risk, inventory management, and language model alignment through tailored quantile objectives.

Quantile Reward Policy Optimization (QRPO) is a class of reinforcement learning (RL) and policy optimization algorithms that replaces the traditional expectation-based optimization objective with a quantile-based criterion. Instead of maximizing the mean of the return or reward distribution, QRPO methods directly optimize a specific quantile (for example, the 90th percentile), providing risk-sensitive control and robust performance that can be tailored to domain-specific requirements such as safety, fairness, or strict quality-of-service constraints. QRPO encompasses both value-based and policy-gradient methods, distributional RL algorithms, and reward transformation strategies, and is applied in domains ranging from classical RL control to LLM alignment.

1. Rationale for Quantile-Based Criteria in Policy Optimization

Traditional RL methods maximize the expected (mean) cumulative reward, which is often insufficient in environments with asymmetric, multimodal, or heavy-tailed reward distributions. Quantile-based criteria address well-documented limitations of this mean-based approach:

In risk-sensitive or safety-critical applications, optimizing the mean can mask rare but catastrophic outcomes, as the average may be skewed by infrequent large deviations or outliers (Gilbert et al., 2016, Li et al., 2017).
For ordinal feedback or settings where only a ranking or threshold is meaningful, quantile objectives are natural and robust (Gilbert et al., 2016).
In decision-making contexts such as financial portfolio management (Value-at-Risk), cloud QoS, or robust control, performance guarantees are required w.r.t. the tails (e.g., "ensure with 99% probability that a loss does not exceed x") (Gilbert et al., 2016, Gilbert et al., 2016, Jung et al., 2022).
For aligning LLMs with human feedback, quantile reward models can accurately represent diverse, possibly conflicting, or noisy preferences, avoiding over-collapsing to means (Dorka, 16 Sep 2024).

The quantile of a random variable $X$ at level $\tau$ ( $q_\tau$ ) is defined as $q_\tau = \inf\{x : F_X(x) \geq \tau\}$ , where $F_X(x)$ is the cumulative distribution function. Policies optimized for a quantile maximize the probability that returns exceed (or do not fall below) a specified threshold.

2. Formalization and Algorithmic Foundations

Policy Evaluation and Quantile Criteria

In classical Markov Decision Processes (MDPs), policies are ranked by the expectation of the cumulative reward. In quantile-based MDPs, policy $\pi$ is evaluated by the quantile of the reward distribution it induces. For episodic MDPs with ordered end states $\mathcal G$ , various forms are defined (Gilbert et al., 2016, Gilbert et al., 2016):

Lower $\tau$ -quantile:

$q_{\tau}^{\pi-} = \min\{g \in \mathcal G | F^{\pi}(g) \geq \tau\}, \quad F^{\pi}(g) = \sum_{g' \preceq g} p^{\pi}(g')$

Upper $\tau$ -quantile:

$q_{\tau}^{\pi+} = \max\{g \in \mathcal G | G^{\pi}(g) \geq 1-\tau\}, \quad G^{\pi}(g) = \sum_{g \preceq g'} p^{\pi}(g')$

In general, quantile objectives are non-linear and non-dynamically consistent, complicating direct application of standard dynamic programming techniques (Gilbert et al., 2016, Li et al., 2017).

Quantile Q-Learning (QQ-Learning) and Two-Timescale Stochastic Approximation

To optimize quantiles in value-based RL, the QQ-learning algorithm extends Q-learning via:

Parameterized Reward Functions: For a running quantile threshold parameter $\theta$ , the reward function is constructed such that, for a given $\theta$ , maximizing the expected return of the adjusted reward is equivalent to maximizing the probability of achieving at least the desired quantile (Gilbert et al., 2016).
Two-Timescale Updates: The algorithm interleaves "fast" Q-value updates (as in standard Q-learning) with "slow" adjustments to the quantile threshold parameter $\theta$ until the value of the policy at $\theta$ matches the targeted fraction (e.g., $1-\tau$ for upper quantile).
Convergence: With appropriate separation of time scales (e.g., $\lim_{n\to\infty} \beta_n / \alpha_n = 0$ ), the procedure provably finds the quantile-optimal policy. This two-timescale structure generalizes to policy-gradient and deep RL settings (Gilbert et al., 2016, Jiang et al., 2022, Jiang et al., 2023).

Policy Gradient Estimation for Quantile Objectives

For policy-gradient QRPO (notably QPO and its variants (Jiang et al., 2022, Jiang et al., 2023)), the gradient of the $\alpha$ -quantile of the cumulative reward w.r.t. parameters $\theta$ is given by implicit differentiation:

$\nabla_\theta q(\alpha; \theta) = - \frac{\nabla_\theta F_R(r; \theta)}{f_R(r; \theta)}\bigg|_{r=q(\alpha; \theta)}$

where $F_R(r;\theta)$ is the CDF, $f_R(r;\theta)$ the density of returns under policy $\theta$ , and the gradient of $F_R$ is estimated using the likelihood-ratio:

$\nabla_\theta F_R(r; \theta) = \mathbb{E}\left[ \mathbb{I}\{U(\tau) \leq r\} \nabla_\theta \log \Pi(\tau; \theta) \right]$

This necessitates two coupled recursions: fast quantile tracking and slow policy parameter updates, foundational to QPO and its scalable variant QPPO (Jiang et al., 2022, Jiang et al., 2023).

3. Model Classes and Extensions

Distributional RL and Quantile Regression

Distributional RL methods (e.g., QR-DQN, IQN) approximate not just the expectation but the entire return distribution (Dabney et al., 2017, Dabney et al., 2018, Jeon et al., 18 Jul 2024):

Return distributions are approximated using quantile regression with Dirac atoms at learned locations (QR-DQN) or via a parameterized quantile function mapping from $[0,1]\rightarrow\mathbb{R}$ (IQN).
Quantile regression loss ( $\rho_\tau(u) = u(\tau - \mathbb{I}\{u < 0\})$ ) is minimized for each quantile, providing an unbiased stochastic gradient for minimizing the 1-Wasserstein distance between distributions (Dabney et al., 2017).
This enables risk-sensitive policies, as the full distribution enables computation of distorted expectations (e.g., value-at-risk, CVaR) and supports robust or risk-seeking decision-making (Dabney et al., 2018, Jeon et al., 18 Jul 2024).

Policy Optimization with Quantile Regression

For continuous action spaces, policies can be implicitly represented via their quantile function (Richter et al., 2019):

A neural network $G_\theta(\tau)$ approximates the inverse CDF of the action distribution at quantile $\tau$ .
Advantage-weighted quantile regression further re-weights the regression by the exponentiated advantage, biasing the update toward actions with higher-than-expected returns.

Quantile-Constrained RL (Safety and Constraints)

Safety-critical RL often requires explicit constraints on the tail event probabilities of costs or risks:

Quantile constraints, such as $q_{1-\epsilon}(\pi_\theta)\leq d$ , are enforced using Lagrangian duality and custom gradient estimation for quantile objectives (Jung et al., 2022, Li et al., 17 Dec 2024).
Gradient estimation techniques include direct sampling-based estimation and Large Deviation Principle (LDP)-based tail modeling for rare event quantification.
Advanced update strategies, e.g. "tilted quantile gradient updates," adapt the Lagrange multiplier update step size to compensate for asymmetry and avoid over-conservative policies, ensuring rigorous satisfaction of safety constraints while optimizing return (Li et al., 17 Dec 2024).

Offline QRPO and Causality

Offline settings with unmeasured confounders require robust identification and estimation steps:

Causal-assisted approaches leverage instrumental variables or negative controls to identify structural quantile functions as solutions to nonlinear conditional moment restrictions (Chen et al., 8 Jun 2025).
Minimax estimation and empirical-process techniques are employed for sample-efficient learning and quantile-optimality under modest coverage assumptions, yielding regret bounds of order $\mathcal{O}(n^{-1/2})$ under suitable conditions (Chen et al., 8 Jun 2025).

4. Practical Applications and Empirical Evidence

QRPO algorithms have been evaluated across a range of settings:

Episodic RL: In the "Who Wants to Be a Millionaire" benchmark, QQ-learning was shown to reliably converge to the targeted quantile, with empirical policy performance closely reflecting the desired performance guarantee for the upper quantile (Gilbert et al., 2016).
Distributional Control: In Atari-2600 benchmarks, QR-DQN and IQN achieved higher human-normalized scores and improved sample efficiency relative to value-based baselines, demonstrating that quantile-based distributional learning improves both robustness and learning speed (Dabney et al., 2017, Dabney et al., 2018, Jeon et al., 18 Jul 2024).
Financial Risk, Inventory, Portfolio Management: QPO/ QPPO and quantile-constrained methods improved lower-tail performance (VaR), yielding policies less subject to catastrophic loss and greater robustness against adverse events (Jiang et al., 2022, Jiang et al., 2023, Jung et al., 2022).
LLM Alignment: QRPO enables direct regression to a closed-form KL-regularized optimal policy using quantile-transformed robust rewards, outperforming DPO, REBEL, and SimPO across chat and code generation tasks. QRPO is less susceptible to length bias and is robust against loss of information typical in relative preference preprocessing (Matrenok et al., 10 Jul 2025).
RLHF/Preference Modeling: Quantile regression applied directly to reward models (Quantile Reward Models, QRMs) achieves higher scores and better robustness in RLHF settings, enabling risk-aware policies that reduce the frequency of extremely negative responses (Dorka, 16 Sep 2024).

Empirical studies suggest that QRPO methods excel especially when:

The reward distribution is multimodal, skewed, or heavy-tailed.
Absolute, high-fidelity reward signals are available, enabling precise policy fitting without reliance on pairwise or relative rewards.
Safety, fairness, or high-probability performance guarantees are paramount.

5. Methodological Analysis and Implementation Considerations

Algorithmic Structure

A unifying theme across QRPO realizations is multi-timescale stochastic approximation:

Method	Fast Update	Slow Update	Extension/Notes
QQ-Learning	Q-learning on reward $R_\theta$	Update $\theta$ quantile threshold	Two-timescale, tabular RL
QPO/QPPO	Quantile tracking (root finding)	Policy gradient step	Neural policies, off-policy extensions
QCPO/TQPO	Quantile gradient step	Lagrange multiplier (adaptive)	Safety-constrained, tilted Lagrange update
Distributional	Quantile regression on returns	Policy update via expected utility	Risk-sensitive, distributional RL

Practical implementations can use neural networks for both value and policy function approximation.
Quantile estimation may require density estimation or kernel methods for accurate derivative computation, especially when using stochastic gradients in non-tabular settings (Jiang et al., 2022).
Distributional methods support direct policy optimization with respect to a family of risk measures via weighting or non-uniform sampling over quantile indices (Dabney et al., 2018).
For offline QRPO with confounding, nonparametric minimax estimation over function classes with uniform convergence guarantees is required—a substantial computational demand that may require further algorithmic advances for scale (Chen et al., 8 Jun 2025).

Computational and Scaling Notes

The complexity of QRPO increases with the number of quantile targets and the necessity to estimate or store reward/return distributions.
In model alignment (LLMs), QRPO leverages pre-computation scaling: performance can be improved by increasing the number of reference completions used to estimate the reference CDF and quantile reward, with diminishing returns as preciseness saturates (Matrenok et al., 10 Jul 2025).
Scalability in distributional and policy-gradient QRPO is facilitated by sharing computations across batches and leveraging batch-based quantile regression (e.g., quantile Huber loss in QR-DQN/IQN).

6. Connections, Extensions, and Ongoing Directions

Quantile Reward Policy Optimization unifies concepts across RL, control, distributional methods, and policy evaluation, motivating several further research directions:

Multi-objective and Lexicographic QRPO: In domains where multiple quantiles (or objectives) must be simultaneously optimized, lexicographic preference orderings and backward induction algorithms yield hierarchical QRPO policies (Li et al., 2017).
Combination with Distributional Reward Models: Integrating quantile reward modeling (as in QRMs) with distributional policy optimization has the potential to construct systems that are robust both to reward model uncertainty and value distribution tails, especially in RLHF and model alignment (Dorka, 16 Sep 2024, Matrenok et al., 10 Jul 2025).
Efficient Causal QRPO in Confounded, Offline Data: Advances in identification, minimax estimation, and regularized optimization are enabling sample-efficient and robust quantile-optimal policy learning under unmeasured confounding and limited coverage (Chen et al., 8 Jun 2025).
Exact Partition Functions and Regression-based Policy Learning: The use of quantile rewards to obtain analytically tractable loss functions for KL-regularized RL objectives (as in QRPO for LLMs) allows efficient, fully offline policy regression from pointwise signals; this is a key advance over existing relative/preference-based techniques (Matrenok et al., 10 Jul 2025).
Safety Guarantees and Adaptive Constraint Satisfaction: Quantile constraint methods provide more rigorous control over tail risks than expectation-based analogs, but also require careful design of multiplier update rules and empirical density estimation for robust convergence and performance (Jung et al., 2022, Li et al., 17 Dec 2024).

7. Summary Table of Selected QRPO Algorithms

Paper & Algorithm	Policy Objective	Key Feature	Domain/Application
(Gilbert et al., 2016) QQ-Learning	Upper/lower quantile	Two-timescale Q-learning	Episodic RL, TV game modeling
(Dabney et al., 2017, Dabney et al., 2018) QR-DQN/IQN	Return distribution	Quantile regression, distributional RL	Atari-2600 benchmarks
(Jiang et al., 2022, Jiang et al., 2023) QPO/QPPO	α-quantile (policy-grad.)	Coupled recursions, off/on-policy	Deep RL, finance, portfolio
(Jung et al., 2022) QCPO	Cost quantile constraint	Quantile Lagrangian, LDP tail modeling	Safe RL, outage probability
(Li et al., 17 Dec 2024) TQPO	Direct quantile-constraint	Tilted Lagrange update, sampling grad.	Safety-critical RL
(Matrenok et al., 10 Jul 2025) QRPO (policy fitting)	KL-reg. RL (quantile rew.)	Analytic partition function, regression	LLM alignment, code gen.
(Dorka, 16 Sep 2024) QRM (reward model)	Distributional reward	Multimodal capture, RLHF, risk-aware	LLM feedback, safety
(Chen et al., 8 Jun 2025) Causal QRPO	Offline quantile-optimal	Moment restriction, minimax opt., IV/NC	Causal-inference, policy learning

References

(Gilbert et al., 2016, Gilbert et al., 2016, Li et al., 2017, Dabney et al., 2017, Li et al., 2017, Dabney et al., 2018, Richter et al., 2019, Jiang et al., 2022, Jung et al., 2022, Xu et al., 2022, Jiang et al., 2023, Luis et al., 2023, Jeon et al., 18 Jul 2024, Dorka, 16 Sep 2024, Li et al., 17 Dec 2024, Chen et al., 8 Jun 2025, Matrenok et al., 10 Jul 2025)

Quantile Reward Policy Optimization thus forms a theoretically grounded, empirically validated, and methodologically diverse paradigm for achieving robust, risk-sensitive, and performance-guaranteed policy learning in reinforcement learning, control, and large-scale LLM alignment settings.