Offline Preference-based RL
- Offline Preference-based RL is a framework that learns policies from fixed datasets using human trajectory preferences instead of explicit scalar rewards, enhancing safety and sample efficiency.
- It integrates reward inference methods like Bradley-Terry loss, listwise models, and diffusion-based approaches with offline algorithms such as CQL and IQL for robust policy optimization.
- This paradigm has proven effective in robotics, simulation, and language tasks by addressing challenges in reward engineering, label noise, and out-of-distribution generalization.
Offline Preference-based Reinforcement Learning (PbRL) is a paradigm in reinforcement learning that seeks to optimize policies from fixed datasets of trajectories using human or expert preferences over those trajectories, rather than explicit scalar rewards. This approach is especially valuable in domains where reward engineering is difficult or infeasible, and direct environment interaction is costly or unsafe. Offline PbRL intersects research in offline RL, reward inference from human feedback, credit assignment, sample efficiency, robustness to label noise, and safe policy optimization. Over the last several years, a rich body of work has advanced both the theoretical foundations and practical algorithms of offline PbRL, enabling strong performance across robotics, simulation, and language domains.
1. Core Problem Formulation
Offline PbRL operates within a Markov Decision Process (MDP) , but does not assume access to the true reward function . Instead, the learner receives an offline dataset of transition tuples or entire trajectories, and a finite preference dataset with annotations comparing short trajectories or trajectory segments. Preferences may be binary, ternary, or ranked listwise judgments.
The typical workflow involves two phases:
- Reward Inference: Using preference data, infer a surrogate (implicit or explicit) reward model such that, for two trajectory segments , the probability that is modeled via a parametric likelihood—most commonly the Bradley–Terry model:
- Policy Optimization: Treat the inferred as the reward function and apply any offline RL algorithm (e.g., CQL, IQL, Decision Transformer) to the offline data, yielding a policy .
Variants extend these phases: direct contrastive policy optimization (An et al., 2023), contextual/hindsight embedding (Kang et al., 2023), credit assignment through auxiliary data (Gao et al., 21 Aug 2025), and direct optimization of the policy via preference feedback without explicit reward modeling (An et al., 2023).
2. Reward Inference Methodologies
The majority of offline PbRL algorithms infer scalar rewards using supervised preference-modeling losses. Key formulations include:
- Bradley-Terry Cross-Entropy Loss:
Minimize
using neural network parameterizations (Zhang et al., 2024, Xu et al., 2024, Tu et al., 2024).
- Listwise and Second-Order Models:
Listwise Reward Estimation (LiRE) (Choi et al., 2024) constructs a ranked list of trajectory segments for richer preference supervision, amplifying signal strength and improving credit assignment.
- Diffusion-based Models:
Diffusion Preference-based Reward (DPR) (Pang et al., 3 Mar 2025) learns generative preference-conditioned distributions over state-action pairs and extracts discriminative rewards through denoising reconstruction errors. This overcomes capacity limitations of MLPs and Transformers.
- Contrastive Embedding-based Approaches:
SARA (Rajaram et al., 14 Jun 2025) trains a sequence encoder to cluster preferred trajectories in latent space; the reward becomes the cosine similarity to the "preferred prototype", yielding robustness to label noise.
- Direct Policy Optimization:
Bypassing reward modeling, DPPO (An et al., 2023) treats preference alignment as a contrastive objective over action-segment distances, updating to imitate preferred behaviors directly.
- Action-Constraint and Pessimism:
PRC (Xu et al., 2023) restricts actions to the support of the offline dataset, enforcing hard pessimism to avoid reward hacking and out-of-distribution exploitation.
3. Policy Learning and Regularization
Offline PbRL inherits the challenges of conventional offline RL—especially distributional shift and OOD generalization. Key techniques include:
- In-Dataset Trajectory Return Regularization (DTR):
Combines Decision Transformer imitation with TD-learning critics, ensuring the learned policy remains close to high-return trajectories from the data and mitigating unrealistic trajectory stitching (Tu et al., 2024).
- Credit Assignment with Demonstration Priors:
Search-Based Preference Weighting (SPW) (Gao et al., 21 Aug 2025) weights steps within preference segments by nearest-neighbor similarity to expert demonstrations, focusing reward gradients on critical transitions.
- Robust Planning and Conservatism:
Distributionally robust optimization and adversarial two-player game formulations (APPO) guarantee conservatism without requiring explicit confidence set constructions (Zhan et al., 2023, Kang et al., 7 Mar 2025). The agent alternately maximizes policy value while the adversary minimizes it within data-supported models.
- Ensemble Normalization and Pseudo-Label Filtering:
Ensemble aggregation for reward predictions and variance/confidence-based filtering of pseudo-labeled trajectories maintain differentiation and minimize incorrect reward assignments, boosting sample efficiency (Liu et al., 2024).
- Online Preference Fine-Tuning and Safe Alignment:
Fine-tuning behavioral cloning policies (BRIDGE) (Macuglia et al., 30 Sep 2025) blends offline demonstrations with online preference queries in an uncertainty-weighted objective, achieving sublinear regret that vanishes as offline data grows. Offline Safe POHF/PreSa (Gong et al., 23 Dec 2025) integrates direct preference and binary safety feedback for constrained policy learning via Lagrangian optimization.
4. Sample Efficiency, Robustness, and Theoretical Guarantees
Much recent work aims to maximize the utility of limited preference labels:
- Active Data Augmentation:
LEASE (Liu et al., 2024) uses learned transition models to generate synthetic rollouts, then pseudo-labels augmented trajectories via an ensemble reward model; only high-confidence, low-variance data are retained, dramatically improving label efficiency.
- Posterior Sampling / Active Querying:
PSPL (Agnihotri et al., 31 Jan 2025) maintains Bayesian posteriors over both reward and dynamical parameters, selecting online policies to compare using top-two Thompson sampling; provable bounds on Bayesian regret decay rapidly with preference budget and offline coverage.
- Preference Elicitation from Simulated Rollouts:
Sim-OPRL (Pace et al., 2024) employs a learned transition model to elicit human feedback on simulated trajectories, pairing pessimistic planning (for OOD robustness) with optimistic query selection (maximizing reward uncertainty).
- Trajectory-wise Concentrability and Provability:
Recent theory (Zhan et al., 2023, Liu et al., 2024) establishes that trajectory-wise, not per-step, concentrability is necessary for minimizing suboptimality. Guarantees are now formulated in terms of how well the offline data "cover" the learned or target policy at the level of entire trajectories or state-action pairs.
- Generalization Bound for Reward Models:
With high probability, the empirical pseudo-label error, model capacity (Rademacher complexity), and number of labels control reward prediction error and policy performance (Liu et al., 2024). State-action based bounds now supersede trajectory-level rates.
5. Robustness to Label Noise, Feedback Structure, and Safety Constraints
Key advances in robustness include:
- Contrastive Objective Robustness:
SARA (Rajaram et al., 14 Jun 2025) empirically achieves up to +31% gain under 20% label flips compared to BT-based baselines, by mitigating the impact of noisy or mismatched preferences through subsetwise latent encoding.
- Preference Strength and Listwise Feedback:
LiRE (Choi et al., 2024), by constructing full ranked lists with ternary and second-order labels, improves reward correlation and policy success rates even under 30% noise, outperforming pairwise-only methods and remaining robust to noisy feedback.
- Safe Alignment from Heterogeneous Feedback:
PreSa (Gong et al., 23 Dec 2025) unites contrastive preference learning and binary safety classification in a constrained Lagrangian optimization, consistently matching or exceeding reward/cost tradeoffs of oracle safe RL even using only offline human signals.
6. Applications and Benchmarking Domains
Offline PbRL methods are empirically validated across continuous control (MuJoCo, Meta-World, DMControl, Adroit), robotics manipulation, healthcare simulators, and text/LLM summarization. Standard metrics include normalized return, success/error rates, rank correlation with true returns, and sample complexity scaling. Several works provide comprehensive ablations and label-efficiency experiments demonstrating competitive or state-of-the-art results using dramatically reduced preference annotation budgets (Xu et al., 2024, Liu et al., 2024, Choi et al., 2024, Macuglia et al., 30 Sep 2025, Gong et al., 23 Dec 2025).
| Algorithm | Label Type | Reward Model | Offline RL | Robustness Mechanism |
|---|---|---|---|---|
| PB-AIL (Zhang et al., 2024) | Pairwise | MLP-based | SAC | Virtual preference integration |
| DPPO (An et al., 2023) | Pairwise | None (contrastive) | --- | Contrastive policy metric |
| SARA (Rajaram et al., 14 Jun 2025) | Point/List | Latent similarity | IQL | Contrastive subset encoding |
| DPR (Pang et al., 3 Mar 2025) | Pairwise | Diffusion-based | CQL, IQL,TD3-BC | Score-based discrimination |
| LEASE (Liu et al., 2024) | Pairwise | Ensemble MLP | CQL, IQL | Confidence/variance filtering |
| Listwise (Choi et al., 2024) | Listwise | MLP (linear BT) | IQL | Second-order preference data |
| PreSa (Gong et al., 23 Dec 2025) | Pairwise+Safety | None | --- | Lagrangian constrained learning |
7. Open Problems and Future Directions
Current limitations include computational hardness in non-convex robust planning, optimal selection of synthetic data and pseudo-label thresholds, limited coverage in OOD regions, and feedback model expressivity—especially for noisy or multi-rater, heterogeneous feedback. Promising directions include more active data/query selection, uncertainty/robustness estimation, direct integration with model-based offline RL, theoretical analysis of diffusion and contrastive embeddings for reward alignment, and federated or multi-agent PbRL frameworks.
Offline PbRL is rapidly progressing towards sample-efficient, robust, and versatile policy learning pipelines that sidestep the need for explicit reward engineering through scalable exploitation of offline data, flexible preference structures, and conservative uncertainty-aware optimization.