Preference & Rating Based RL

Updated 26 November 2025

Preference-based and rating-based RL are human-in-the-loop paradigms that use pairwise comparisons and scalar ratings, respectively, to circumvent traditional reward design challenges.
These methods integrate human feedback into reward inference pipelines, achieving near-optimal regret bounds and efficient policy updates through active query strategies.
Empirical results highlight that rating-based RL offers quicker, less cognitively demanding feedback, while preference-based RL delivers finer-grained comparisons critical for nuanced learning.

Preference-based and rating-based reinforcement learning are two human-in-the-loop RL paradigms designed to circumvent the challenges of reward engineering by leveraging direct human feedback. The central distinction lies in whether feedback is provided via relative preferences between trajectory pairs, or via scalar/ordinal ratings of individual trajectories. Both approaches have motivated extensive algorithmic, theoretical, and empirical research in recent years, with rigorous comparisons, practical algorithm design, and statistical guarantees now available.

1. Formal Definitions and Core Protocols

In preference-based reinforcement learning (PbRL), the agent receives pairwise comparisons: given two trajectory segments $\sigma^0$ , $\sigma^1$ , the supervisor provides a label $y \in \{0, 1, 0.5\}$ indicating preference or tie. The standard stochastic model for preferences is the Bradley–Terry or logistic model, parameterizing the probability of a human preferring one segment over another as

$P[\sigma^1 \succ \sigma^0] = \frac{\exp( R(\sigma^1) )}{ \exp( R(\sigma^0) ) + \exp( R(\sigma^1) ) }$

where $R(\sigma)$ is the predicted return of the segment, possibly inferred from a parametric or non-parametric reward model (Kim et al., 2023, Wang et al., 2023).

In rating-based reinforcement learning (RbRL), feedback is a scalar or multiclass label $y \in \mathbb{R}$ or $y \in \{0, \ldots, n-1\}$ attached to a single trajectory. The rating model seeks to assign predicted returns $\hat R(\sigma)$ to bins or scales, and is typically trained with a multi-class loss on human-labeled examples (White et al., 2023).

Both methods use the inferred reward information to update the policy via standard RL optimization, most often with a modular structure: (1) reward inference from human feedback, (2) RL or control via learned reward.

2. Preference-Based RL: Algorithms and Theoretical Guarantees

Modern PbRL algorithms follow a generic loop: (1) collect policy-generated trajectories, (2) sample candidate trajectory pairs, (3) query human or simulated oracle for preference feedback, (4) update a reward/reward-like model using cross-entropy (logistic) loss, and (5) improve the policy using the latest inferred rewards (Lee et al., 2021, Novoseller et al., 2019, Wu et al., 2023). Variants such as Dueling RL (Pacchiano et al., 2021) use upper-confidence exploration and feature-based generalized linear preference models.

Key theoretical results:

In tabular, linear, and general function-approximation MDPs, optimal or nearly-optimal regret and sample/query complexity bounds are attainable, matching those for scalar-reward RL up to polynomial factors in model complexity (Wu et al., 2023, Wang et al., 2023, Chen et al., 2022).
For any utility-based preference model (e.g., logistic/Bradley-Terry), preference-based RL is not fundamentally more difficult than standard RL: sample and query complexity depend on the eluder dimension or other structural complexity measures of the reward/preference class (Wang et al., 2023).
Key regret bounds are of the form $\widetilde O(d\sqrt{T})$ for $d$ -dimensional models, similar to reward-based bandits (Pacchiano et al., 2021, Novoseller et al., 2019).

Algorithmic advances include:

Active preference querying via uncertainty, disagreement, or coverage to maximize feedback informativeness (Lee et al., 2021).
Parameter-efficient approaches such as Inverse Preference Learning, which avoid explicit reward modeling and fit the Q-function directly to preference data via inverse Bellman operators (Hejna et al., 2023).
Direct policy optimization methods that align policies to human preferences without reward modeling, using contrastive or other pairwise objectives (An et al., 2023).

Recent work also integrates large pretrained models and selective human supervision to reduce annotation costs without sacrificing performance (Ghosh et al., 3 Feb 2025).

3. Rating-Based RL: Methods and Empirical Findings

Rating-based RL algorithms use scalar (e.g., 1–5 stars) or categorical ratings to fit a reward predictor directly by regression/classification. The multi-class loss is a generalization of the cross-entropy to $n$ ordinal bins, often incorporating adaptive binning based on observed label distributions (White et al., 2023). The RL agent then uses the learned scalar reward as the return for standard control methods.

Empirical and user studies show that RbRL offers important advantages over PbRL:

RbRL achieves similar or superior policy performance with 30–60% fewer human feedback queries in both synthetic and real-user studies, attributed to the higher information content per rating (White et al., 2023).
Ratings are more consistent for users, faster to provide (∼60% faster than preferences in timed studies), and less cognitively demanding, leading to higher confidence and lower frustration.
However, ratings are subject to scale miscalibration, inter-rater variability, and bias; binning and adaptive normalization can mitigate but not eliminate such effects. Ratings may also lose fine granularity within bins or at scale boundaries.

A key limitation is that scalar ratings may obscure critical events within a trajectory, as the global score may fail to reveal temporally localized failures or successes that are easily surfaced by pairwise preferences (Kim et al., 2023).

4. Hybrid Algorithms and Extensions

Hybrid and generalized feedback frameworks have emerged:

Multi-way (K-wise) preference queries generalize pairwise models, offering improved statistical efficiency; a single $K$ -way query can be decomposed into up to $\lfloor K/2 \rfloor$ independent pairwise comparisons, reducing total query cost (Wang et al., 2023, Chen et al., 2022).
Preferences can be converted to ratings (e.g., via the Borda method), and ratings can generate pairwise preferences—a conversion used to unify theoretical analyses (White et al., 2023, An et al., 2023).
Some methods (e.g., Preference Transformer (Kim et al., 2023)) model human feedback as a non-Markovian, temporally aware sequence-to-label problem, with transformer architectures attending to “critical events.”
Vision-LLMs and self-supervised semantic adaptation have been used to both generate pseudo-feedback and enable efficient transfer across tasks, with uncertainty filtering maintaining label quality (Ghosh et al., 3 Feb 2025).

Reward inference pipelines may also incorporate auxiliary self-supervised representation learning to capture transition dynamics (as in REED (Metcalf et al., 2022)), which substantially reduces label requirements in PbRL.

5. Robustness, Sample Complexity, and Practical Considerations

Robustness to human “irrationality,” noise, and annotation error is a central concern:

Simulated teacher models with parameters for rationality, myopia, random mistakes, skipping, and indifference have become standard in benchmarks (see B-Pref (Lee et al., 2021)).
Both PbRL and RbRL are affected by query selection: uncertainty-based sampling most consistently yields faster learning, while coverage-based and random sampling are less effective (Lee et al., 2021).
Off-policy reward relabeling and pretraining for enhanced exploration further increase feedback efficiency.
RbRL is more vulnerable to label imbalance (most ratings 'bad'), while PbRL is sensitive to ambiguity in comparisons near the decision boundary.
Algorithms now achieve $\tilde O(\poly(d,H)\sqrt{K})$ regret with efficient query schedules under general function approximation (Chen et al., 2022). Randomized Thompson-style approaches yield optimal trade-offs between regret and query complexity (Wu et al., 2023).

6. Comparative Insights and Open Directions

	Preference-Based RL	Rating-Based RL
Query	Pairwise or K-way comparison	Scalar or ordinal rating per traj.
Loss	Cross-entropy (Bradley-Terry, logistic)	Multi-class or regression
Info per query	1 bit (binary), $< \log_2(\#items!)$ (K-way)	$\sim \log_2(n)$ bits (n classes)
Sampling	Active (uncertainty, disagreement, coverage)	Uncertainty; ensemble variance
User paper	Slower, more cognitively demanding	Faster, less frustrating
Pitfalls	Can miss margin, global quality; less calibration	Sensitive to scale, label drift
Granularity	Captures relative, subtle distinctions	Lacks fine-grained intra-bin order
Theory	Matched regret/query with reward-based RL	Theory analogous under regression

Both paradigms are theoretically competitive with reward-based RL under mild assumptions, and mixture/hybrid approaches (rating followed by local preferences, etc.) are promising open directions. Further challenges include robust scale calibration, adaptive querying, handling non-stationarity in human judgment, integrating richer forms of feedback (natural language, demonstrations), and validating results in large-scale real-world systems (White et al., 2023, Kim et al., 2023, Lee et al., 2021).

7. Notable Models, Benchmarks, and Architectures

Preference Transformer: Non-Markovian sequence model with attention weights, extracting critical events and temporal dependencies in preferences (Kim et al., 2023).
PEBBLE and PrefPPO: Preference-driven pipelines with off-policy relabeling and unsupervised pretraining (Lee et al., 2021, Metcalf et al., 2022).
Inverse Preference Learning (IPL): Q-function based PbRL dispensing with explicit reward models (Hejna et al., 2023).
Direct Preference Policy Optimization (DPPO): Contrastive policy optimization, aligning policies directly with preferences (An et al., 2023).
B-Pref: Benchmark for systematic paper, including a suite of synthetic “irrational” teacher models (Lee et al., 2021).
PrefVLM: Annotation-efficient preference RL via pre-trained vision-LLMs and selective human feedback (Ghosh et al., 3 Feb 2025).
REED: Key innovation in integrating self-supervised dynamics modeling into reward inference, drastically improving data efficiency (Metcalf et al., 2022).
Dueling Posterior Sampling: Bayesian posterior sampling over reward and dynamics in the preference setting, with information ratio based analysis (Novoseller et al., 2019).
Dueling RL: Optimism-based exploration with trajectory preference models, tight finite-time regret bounds (Pacchiano et al., 2021).

These architectures—along with associated benchmarks—represent the current state of the art for leveraging preference and rating-based human feedback in reinforcement learning, offering practical and theoretically grounded frameworks for aligning RL agents with human intent.