Ranked Return Regression for RL (R4)

Updated 21 January 2026

R4 is a reward learning method in reinforcement learning that infers reward functions from human ordinal ratings by regressing predicted trajectory returns to teacher-provided rankings.
It employs a novel ranking mean squared error loss combined with differentiable sorting to match predicted trajectory orderings with actual human ratings.
R4 achieves formal guarantees of minimality and completeness while demonstrating enhanced policy performance and sample efficiency in benchmark robotic locomotion tasks.

Ranked Return Regression for RL (R4) is a method for reward learning in reinforcement learning (RL) that addresses the challenge of inferring reward functions from human ratings rather than explicit reward specification or binary preferences. R4 centers on a novel ranking mean squared error (rMSE) loss, treating trajectory ratings as ordinal targets, and utilizes differentiable sorting to match predicted trajectory rankings to teacher-provided ratings. Formal guarantees on minimality and completeness distinguish R4 within the class of rating-based reward learning algorithms, and empirical results demonstrate competitive or superior policy performance and sample efficiency in robotic locomotion benchmarks (Kharyal et al., 14 Jan 2026).

1. Problem Setting and Notation

R4 operates in a Markov Decision Process (MDP) without a built-in reward, supplemented by teacher ratings. The core elements are:

State space $\mathcal{S}$ , action space $\mathcal{A}$ , transition kernel $T(s'|s, a)$ , initial distribution $p_0(s)$ , and discount $\gamma \in [0, 1)$ .
Full trajectories $T = (s_0, a_0, \dots, s_H, a_H)$ are labeled by a teacher with an ordinal rating $c(T) \in \{0, 1, \dots, n-1\}$ , with $0$ denoting "worst" and $n-1$ denoting "best."
The dataset $D = \bigcup_{k=0}^{n-1} D_k$ , where $D_k = \{T : c(T) = k\}$ .
Reward function $f_\theta: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ with predicted trajectory return $G_\theta(T) = \sum_{t=0}^{H} \gamma^t f_\theta(s_t, a_t)$ .
Objectives:
1. Learn $\theta$ such that $G_\theta$ respects ordinal classes $c(T)$ .
2. Use $f_\theta$ as the reward function for policy optimization, maximizing $\mathbb{E}[\sum \gamma^t f_\theta(s_t, a_t)]$ in place of a hand-crafted reward.

2. Ranking Mean Squared Error (rMSE) Loss

R4 employs the rMSE loss to regress predicted soft ranks to ordinal ratings:

In each gradient step, sample one trajectory $T_i$ from each class $D_{c_i}$ for $i=0,\dots,n-1$ so $c_i$ is strictly increasing.
Compute predicted returns $x_i = G_\theta(T_i)$ .
Compute soft ranks $R_i = R^\varepsilon_i(x) \in [0, n-1]$ as a differentiable proxy for the integer rank of $x_i$ .
The rMSE loss is defined as:

$L_{\mathrm{rMSE}}(\theta) = \frac{1}{n} \sum_{i=0}^{n-1}(R_i - c_i)^2$

This penalizes deviations between the continuous soft ranks and discrete ratings, encouraging correct trajectory ordering without imposing interval constraints within each class.

3. Differentiable Sorting: Soft Ranks

R4 utilizes a differentiable sorting operator for ranking:

Let $x \in \mathbb{R}^n$ represent predicted returns $[x_0, \ldots, x_{n-1}]$ .
Define the cost matrix $C(x)$ by $C_{ij} = |x_i - x_j|$ .
Soft permutation matrix $P^\varepsilon \in \mathbb{R}^{n \times n}$ is obtained by solving the entropic optimal transport problem:

$P^\varepsilon(x) = \arg\min_{P \in U_n} \langle P, C(x) \rangle - \varepsilon H(P)$

where $U_n = \{P \geq 0 : P1 = 1, P^T 1 = 1\}$ (Birkhoff polytope), and entropy $H(P) = -\sum_{ij} P_{ij} \log P_{ij}$ .

After Sinkhorn iterations, compute the soft-rank vector:

$R^\varepsilon(x) = P^\varepsilon(x) \cdot [0, 1, ..., n-1]^T$

As $\varepsilon \to 0$ , the operator recovers hard sorting; for $\varepsilon > 0$ , gradients propagate through $x_i = G_\theta(T_i)$ to $\theta$ .

4. Training Algorithm

The R4 algorithm proceeds in two phases—reward learning and policy optimization—often cycling between the two:

Initialization:
- Reward network $f_\theta$ .
- Policy $\pi$ (e.g., Soft Actor-Critic).
- Replay buffer $\mathcal{B}$ .
Training Loop:
- For $k = 0,\dots,n-1$ , sample $T^{(k)}$ from $D_k$ (or from $\mathcal{B}$ ).
- $c_k = k$ , $x_k = G_\theta(T^{(k)})$ .
- Compute $R = R^\varepsilon([x_k])$ .
- Compute $L_{\mathrm{rMSE}} = \frac{1}{n} \sum_k (R_k - c_k)^2$ .
- Update $\theta \leftarrow \theta - \eta_\theta \nabla_\theta(L_{\mathrm{rMSE}} + \lambda \|\theta\|^2)$ .
- Relabel all rewards in $\mathcal{B}$ using $f_\theta$ .
- 3. Run standard RL policy update (e.g., SAC) on $\mathcal{B}$ with $f_\theta$ rewards.

5. Theoretical Guarantees: Minimality and Completeness

R4 provides formal guarantees on the resulting learned reward functions under mild assumptions:

Deterministic Realizability: A ground-truth $r^* \in \mathcal{H}$ exists, such that $G^*(T) = \sum \gamma^t r^*(s_t, a_t)$ .
Binning: Teacher ratings result from thresholding $G^*(T)$ into $n$ ordinal bins.
Model Realizability: $r^*$ is in the class $\mathcal{H}$ of $f_\theta$ .
Exact Differentiable Sorting: The soft-rank operator recovers exact ranks when returns are strictly ordered.
Feasible Set:

$R = \{ r \in \mathcal{H} : c(T_i) < c(T_j) \implies G_r(T_i) < G_r(T_j) \ \forall T_i, T_j \in D \}$

Consistency: $r^*$ minimizes $L_{\mathrm{rMSE}}$ .
Completeness and Minimality: $\arg\min_\theta L_{\mathrm{rMSE}}(\theta) = R$ —i.e., the solution set contains all and only reward functions inducing the teacher’s ordering.
Relaxation to bounded soft-rank error ( $|\hat{R}_i - \text{true rank}| \leq \varepsilon$ ) preserves guarantees if $\varepsilon$ is small relative to $n$ .

6. Empirical Evaluation

R4 has been empirically evaluated in simulated human feedback scenarios, comparing both offline and online feedback regimes against rating-based RL (RbRL) and preference-based methods (PEBBLE, SURF, QPA). The central performance metric is the downstream policy's undiscounted environment return.

Offline Feedback:
- Domains: OpenAI Gym Reacher, Inverted Double Pendulum, HalfCheetah.
- Dataset: 50–100 trajectories per class, labeled via ground-truth return thresholds.
- Results: R4-reward policies learn statistically faster ( $p < 0.05$ ) and reach higher final returns than SAC trained on ground-truth reward and RbRL in Reacher.
Online Feedback:
- Domains: DeepMind Control Suite (Walker-walk/stand, Cheetah-run, Quadruped-walk/run, Humanoid-stand).
- Budget: Fixed rated trajectories (100–200); preference methods with equivalent feedback count.
- Baselines: RbRL, PEBBLE, SURF, QPA with optimal sampling.
- Results: R4 matches or outperforms baselines, achieving faster learning or higher final return in 4/6 tasks (Bonferroni-corrected $p < 0.0125$ ).

Across both regimes, R4 models yield higher Trajectory Alignment Coefficient, maintain robustness to noisy labels (up to 80% noise), and are insensitive to the choice of $n$ .

7. Practical Considerations

R4 presents practical advantages and implementation choices:

Computational Cost: Each rMSE update involves $B$ network forward passes, a Sinkhorn solve (typically 5–20 iterations, size $n \times n$ , worst-case $O(n^3)$ ), and $B$ backpropagations. Standard $n \leq 10$ , $B = 4$ –8, so computational overhead is modest.
Hyperparameters:
- Soft-sort regularization $\varepsilon \approx 0.01$ –$1.0$, robust over wide ranges.
- Policy learning batch size per SAC conventions (256).
- Reward-network architecture: MLP with 2–4 hidden layers, ReLU activations; no dedicated output scaling.
- Optional $\ell_2$ weight decay $(\lambda \approx 10^{-3}$ – $10^{-2})$ .
Online Feedback Enhancements:
- Dynamic feedback schedule: high labeling frequency early, decaying over time.
- Stratified trajectory sampling: combine top-return and low-return trajectories from buffer of last 50, randomly sample sub-segments of length 8.
- Ensemble of 3–5 reward networks for prediction stability.
Limitations & Open Problems:
- Accuracy depends on soft-sort approximation; large noise or tied returns require careful $\varepsilon$ tuning.
- Theoretical results depend on deterministic returns and model realizability; generalization to stochastic rewards and mis-specified $\mathcal{H}$ is open.
- Human-in-the-loop usability: preliminary pilots show robustness to varied labeling, but large-scale user studies are pending.

R4 is characterized as a simple, hyperparameter-lean method for learning reward models from multi-class trajectory ratings by regressing soft ranks to ordinal ratings under a differentiable sorting operator, inheriting formal guarantees and demonstrating empirically strong sample efficiency and policy performance in both offline and online RL applications (Kharyal et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Reward Learning through Ranking Mean Squared Error (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ranked Return Regression for RL (R4).

Ranked Return Regression for RL (R4)

1. Problem Setting and Notation

2. Ranking Mean Squared Error (rMSE) Loss

3. Differentiable Sorting: Soft Ranks

4. Training Algorithm

5. Theoretical Guarantees: Minimality and Completeness

6. Empirical Evaluation

7. Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Ranked Return Regression for RL (R4)

1. Problem Setting and Notation

2. Ranking Mean Squared Error (rMSE) Loss

3. Differentiable Sorting: Soft Ranks

4. Training Algorithm

5. Theoretical Guarantees: Minimality and Completeness

6. Empirical Evaluation

7. Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research