Preference As Reward (PAR)

Updated 7 January 2026

Preference As Reward (PAR) is a framework that uses pairwise human comparisons to define reinforcement signals instead of fixed numerical rewards.
PAR employs models like Bradley–Terry and latent similarity methods to robustly infer reward structures from noisy feedback.
Empirical results show that PAR improves sample efficiency, reward-model accuracy, and resilience against reward hacking in RL environments.

Preference As Reward (PAR) denotes a broad family of frameworks, models, and algorithms that treat human (or synthetic) preferences over behavioral trajectories as the direct source of the reinforcement signal for training autonomous agents. Instead of specifying a numerical reward function, PAR infers or directly uses preferences—most commonly pairwise comparisons, sometimes with ties or group feedback—as the basis for policy optimization and reward modeling. This paradigm addresses longstanding challenges in reinforcement learning (RL), such as reward specification, interpretability, robustness to label noise, sample efficiency, personalized alignment, and mitigation of spurious correlations (reward hacking).

1. Formal Modeling of Preferences as Reward Signals

PAR frameworks universally operate over Markov decision processes (MDP) where agents interact with environments defined by state space $S$ , action space $A$ , transition kernel, and an unknown objective reward $r: S \times A \rightarrow \mathbb{R}$ . Human feedback is collected as pairwise preferences over trajectory segments (or behaviors), represented as tuples $(\tau^A, \tau^B, y)$ , with $y \in \{+1, -1, 0\}$ indicating which segment is preferred (or tied).

A typical probabilistic model for preference likelihood is the Bradley–Terry (BT) model: $P_\psi[A \succ B] = \frac{\exp(\sum_{t=1}^H \hat r_\psi(s^A_t, a^A_t))}{\exp(\sum_{t=1}^H \hat r_\psi(s^A_t, a^A_t)) + \exp(\sum_{t=1}^H \hat r_\psi(s^B_t, a^B_t))}$ where $\hat r_\psi$ is a neural or differentiable surrogate for the (unknown) true reward. This form is adapted or extended in various settings:

Generalized BT with ties (BTT), adding a tie-tendency hyperparameter $\theta$ to directly model ties and their effect on induced preference strengths (Liu et al., 2024).
Regret-based preferences, where annotators are modeled as assessing segment regret relative to optimal policies rather than simple cumulative returns (Knox et al., 2022).
Importance-weighted or contrastive loss formulations that bypass explicit reward heads by direct supervision over preferences or policy outputs (An et al., 2023, Ye et al., 2024).

PAR learning objectives minimize cross-entropy between model-predicted and observed preferences, optionally regularized for robustness. In settings with crowdsourced feedback, annotator reliability and consensus labels are inferred via spectral methods (SML), enabling robust aggregation and unsupervised minority detection (Chhan et al., 2024).

2. Reward Modeling and Policy Optimization under PAR

PAR frameworks instantiate reward as the agent’s training signal in several ways:

Scalar surrogate reward modeling: Fit a function $\hat r_\psi(s,a)$ via BT/BTT cross-entropy over the preference dataset, then maximize the expected cumulative surrogate reward using standard RL algorithms (e.g., PPO, SAC) (Chhan et al., 2024, Verma et al., 2024).
Contrastive/generative judge approaches: Directly optimize the policy’s likelihood of outputting preferred segments (or correct judgments with rationales) without an explicit scalar reward head (Ye et al., 2024, An et al., 2023).
Preference-aware shaping: Use bounded, centered shaping functions (e.g., sigmoid of reward difference relative to reference outputs) to ensure hack-resistance and robust learning dynamics (Fu et al., 26 Feb 2025).
Latent similarity-based reward estimation: Learn encoders whose output similarity to preferred behavior sets is used as the scalar reward, offering resilience to label noise and flexible feedback integration (Rajaram et al., 14 Jun 2025).
Diffusion modeling and credit assignment: Model preferences at the state-action or segment level via diffusion discriminators (DPR/C-DPR) or redistribute trajectory-level returns proportional to state importance learned from world models (Pang et al., 3 Mar 2025, Verma et al., 2024, Verma et al., 2022).

Policy optimization is typically driven via interleaved trajectory sampling, preference querying, reward model updating, and RL steps—sometimes with off-policy data efficiency and importance sampling (Jiang et al., 2023), or with dynamic rubrics that enhance interpretability (Jian et al., 28 Oct 2025).

3. Extensions: Robustness, Crowdsourcing, Shortcut Mitigation

PAR systems are equipped with mechanisms to address:

Noisy annotator feedback: Robust label aggregation via spectral meta-learning (SML), which ranks annotator reliability, improves label-error rates and enables accurate policy training even with large spreads in crowd error rates (Chhan et al., 2024).
Reward hacking and shortcut exploitation: Regularization via group-invariant kernels, penalization or decorrelation from spurious features (verbosity, sycophancy), bounded shaping functions, and additive correction terms to designer-supplied proxy rewards (PBRR), all mitigate policy collapse to degenerate behaviors (Ye et al., 21 Oct 2025, Fu et al., 26 Feb 2025, Hatgis-Kessell et al., 14 Oct 2025).
Credit assignment: Structural priors leveraging world-model-derived state importances, symbolic abstractions, and attention mechanisms yield more accurate, data-efficient reward learning in sparse or delayed-feedback environments (Verma et al., 2024, Verma et al., 2022).

4. Algorithmic Implementations and Training Procedures

PAR algorithms are characterized by:

Preference data acquisition: Iterative or active querying strategies maximize information gain about behaviorally-relevant reward function equivalence classes, optimizing downstream policy performance over simple raw parameter identification (Ellis et al., 2024).
Reward model update: Gradient descent on negative log-likelihood of preferences, augmented by priors, regularizers, or attention-guided redistribution objectives.
Policy update: Use the learned reward model or direct contrastive/likelihood outputs as the RL signal for policy optimization, often via standard PPO, SAC, or off-policy importance samplers.

PAR frameworks support reinforcement learning from human feedback with strong empirical sample efficiency, especially when using structural or symbolic priors, targeted repair, and robust aggregation (Hatgis-Kessell et al., 14 Oct 2025, Verma et al., 2022).

5. Empirical Results and Performance Benchmarks

PAR methods consistently outperform baselines in key metrics:

Reward-model accuracy: Margin-aware, rubric-adaptive PAR models show +4.7% relative improvement over standard scalar RM baselines on RewardBench and RMBench (Jian et al., 28 Oct 2025).
Policy returns: Crowd-PrefPPO trained via robust label aggregation approaches Oracle performance and surpasses majority-vote policies in both simulated and real crowd environments (Chhan et al., 2024).
Robustness to reward hacking: Bounded, centered PAR shaping shows superior stability, preventing reward hacking and premature policy collapse, matching or exceeding reference win rates (Fu et al., 26 Feb 2025).
Data efficiency: PRIors and targeted repair require as few as one-tenth the preferences of full RLHF methods to recover optimal or near-optimal policies (Verma et al., 2022, Hatgis-Kessell et al., 14 Oct 2025).
Label noise and crowd diversity: SARA and DPR/C-DPR yield stable policy learning even with ≥20% non-expert label noise, and diffusion-based models demonstrate increased expressiveness over MLP/transformer baselines (Pang et al., 3 Mar 2025, Rajaram et al., 14 Jun 2025).

6. Theoretical Foundations and Identifiability

PAR's theoretical foundation has notably advanced:

Identifiability: Regret-based preference models guarantee recovery of the true reward function (up to policy equivalence) from infinitely many preferences, while classic partial-return models can fail identifiability in variable-horizon or stochastic regimes (Knox et al., 2022).
Finite-time analysis: Preference-only oracles, under mild stochastic regularity conditions, suffice to identify $\varepsilon$ -optimal policies with sample complexities comparable to those in full-reward RL and dueling bandits (Xu et al., 2020).
Bias in preference modeling: Explicit modeling of ties (BTT) corrects systematic underestimation of preference strength, yielding truer rankings and more aligned downstream behavior (Liu et al., 2024).

7. Limitations, Generalizations, and Future Directions

Limitations persist, including requirements for informative proxy rewards in targeted repair, expertise assumptions in regret-based preference modeling, and computational costs of successor-feature or diffusion approaches.

Active domains of future work include:

Automated discovery of shortcut features and dynamic regularization (Ye et al., 21 Oct 2025)
Active learning for efficient query scheduling in the context of behavioral equivalence (Ellis et al., 2024)
Multi-modal, multi-user preference aggregation and interpretability via generative rationales (Ye et al., 2024, Jian et al., 28 Oct 2025)
Extension of preference-latent architectures for high-dimensional or hierarchical tasks (Rajaram et al., 14 Jun 2025)
Low-resource, language-model, and multimodal RLHF settings (Ye et al., 21 Oct 2025, Jian et al., 28 Oct 2025)

Preference As Reward thus constitutes a rigorous, empirically validated, and extensible paradigm for aligning agents with human objectives and values via direct optimization on structured preference data, robust aggregation and regularization, and interpretability-driven modeling.