Ranked Return Regression for RL (R4)
- R4 is a reward learning method in reinforcement learning that infers reward functions from human ordinal ratings by regressing predicted trajectory returns to teacher-provided rankings.
- It employs a novel ranking mean squared error loss combined with differentiable sorting to match predicted trajectory orderings with actual human ratings.
- R4 achieves formal guarantees of minimality and completeness while demonstrating enhanced policy performance and sample efficiency in benchmark robotic locomotion tasks.
Ranked Return Regression for RL (R4) is a method for reward learning in reinforcement learning (RL) that addresses the challenge of inferring reward functions from human ratings rather than explicit reward specification or binary preferences. R4 centers on a novel ranking mean squared error (rMSE) loss, treating trajectory ratings as ordinal targets, and utilizes differentiable sorting to match predicted trajectory rankings to teacher-provided ratings. Formal guarantees on minimality and completeness distinguish R4 within the class of rating-based reward learning algorithms, and empirical results demonstrate competitive or superior policy performance and sample efficiency in robotic locomotion benchmarks (Kharyal et al., 14 Jan 2026).
1. Problem Setting and Notation
R4 operates in a Markov Decision Process (MDP) without a built-in reward, supplemented by teacher ratings. The core elements are:
- State space , action space , transition kernel , initial distribution , and discount .
- Full trajectories are labeled by a teacher with an ordinal rating , with $0$ denoting "worst" and denoting "best."
- The dataset , where .
- Reward function with predicted trajectory return .
- Objectives:
- Learn such that respects ordinal classes .
- Use as the reward function for policy optimization, maximizing in place of a hand-crafted reward.
2. Ranking Mean Squared Error (rMSE) Loss
R4 employs the rMSE loss to regress predicted soft ranks to ordinal ratings:
In each gradient step, sample one trajectory from each class for so is strictly increasing.
Compute predicted returns .
Compute soft ranks as a differentiable proxy for the integer rank of .
The rMSE loss is defined as:
This penalizes deviations between the continuous soft ranks and discrete ratings, encouraging correct trajectory ordering without imposing interval constraints within each class.
3. Differentiable Sorting: Soft Ranks
R4 utilizes a differentiable sorting operator for ranking:
Let represent predicted returns .
Define the cost matrix by .
Soft permutation matrix is obtained by solving the entropic optimal transport problem:
where (Birkhoff polytope), and entropy .
- After Sinkhorn iterations, compute the soft-rank vector:
- As , the operator recovers hard sorting; for , gradients propagate through to .
4. Training Algorithm
The R4 algorithm proceeds in two phases—reward learning and policy optimization—often cycling between the two:
Initialization:
- Reward network .
- Policy (e.g., Soft Actor-Critic).
- Replay buffer .
- Training Loop:
- For , sample from (or from ).
- , .
- Compute .
- Compute .
- Update .
- Relabel all rewards in using .
- 3. Run standard RL policy update (e.g., SAC) on with rewards.
5. Theoretical Guarantees: Minimality and Completeness
R4 provides formal guarantees on the resulting learned reward functions under mild assumptions:
- Deterministic Realizability: A ground-truth exists, such that .
- Binning: Teacher ratings result from thresholding into ordinal bins.
- Model Realizability: is in the class of .
- Exact Differentiable Sorting: The soft-rank operator recovers exact ranks when returns are strictly ordered.
- Feasible Set:
- Consistency: minimizes .
- Completeness and Minimality: —i.e., the solution set contains all and only reward functions inducing the teacher’s ordering.
- Relaxation to bounded soft-rank error () preserves guarantees if is small relative to .
6. Empirical Evaluation
R4 has been empirically evaluated in simulated human feedback scenarios, comparing both offline and online feedback regimes against rating-based RL (RbRL) and preference-based methods (PEBBLE, SURF, QPA). The central performance metric is the downstream policy's undiscounted environment return.
- Offline Feedback:
- Domains: OpenAI Gym Reacher, Inverted Double Pendulum, HalfCheetah.
- Dataset: 50–100 trajectories per class, labeled via ground-truth return thresholds.
- Results: R4-reward policies learn statistically faster () and reach higher final returns than SAC trained on ground-truth reward and RbRL in Reacher.
- Online Feedback:
- Domains: DeepMind Control Suite (Walker-walk/stand, Cheetah-run, Quadruped-walk/run, Humanoid-stand).
- Budget: Fixed rated trajectories (100–200); preference methods with equivalent feedback count.
- Baselines: RbRL, PEBBLE, SURF, QPA with optimal sampling.
- Results: R4 matches or outperforms baselines, achieving faster learning or higher final return in 4/6 tasks (Bonferroni-corrected ).
Across both regimes, R4 models yield higher Trajectory Alignment Coefficient, maintain robustness to noisy labels (up to 80% noise), and are insensitive to the choice of .
7. Practical Considerations
R4 presents practical advantages and implementation choices:
- Computational Cost: Each rMSE update involves network forward passes, a Sinkhorn solve (typically 5–20 iterations, size , worst-case ), and backpropagations. Standard , –8, so computational overhead is modest.
- Hyperparameters:
- Online Feedback Enhancements:
- Dynamic feedback schedule: high labeling frequency early, decaying over time.
- Stratified trajectory sampling: combine top-return and low-return trajectories from buffer of last 50, randomly sample sub-segments of length 8.
- Ensemble of 3–5 reward networks for prediction stability.
- Limitations & Open Problems:
- Accuracy depends on soft-sort approximation; large noise or tied returns require careful tuning.
- Theoretical results depend on deterministic returns and model realizability; generalization to stochastic rewards and mis-specified is open.
- Human-in-the-loop usability: preliminary pilots show robustness to varied labeling, but large-scale user studies are pending.
R4 is characterized as a simple, hyperparameter-lean method for learning reward models from multi-class trajectory ratings by regressing soft ranks to ordinal ratings under a differentiable sorting operator, inheriting formal guarantees and demonstrating empirically strong sample efficiency and policy performance in both offline and online RL applications (Kharyal et al., 14 Jan 2026).