Papers
Topics
Authors
Recent
2000 character limit reached

Preference-Conditioned Policy Learning

Updated 4 February 2026
  • Preference-Conditioned Policy Learning is a framework that integrates user-defined preference signals into policy optimization for adaptable, decision-making in complex tasks.
  • It replaces traditional scalar rewards with structured comparisons such as pairwise or vector-based preferences, enabling robust multi-objective and causal evaluations.
  • The approach employs techniques like reward model fitting, contrastive policy optimization, and Lagrangian methods to efficiently train policies that align with dynamic user criteria.

Preference-Conditioned Policy Learning (PCPL) is a framework in reinforcement learning and decision-making where policies are conditioned explicitly on preference information provided by users, evaluators, or external supervisors. The central objective is to train agents that can flexibly adapt their behavior in real-time or post-hoc to changing preference specifications, a requirement critical to applications ranging from online reinforcement learning with human feedback, multi-objective decision-making, and preference-based policy optimization, to causal policy learning and healthcare policy discovery.

1. Formal Frameworks of Preference-Conditioned Policy Learning

At its core, PCPL generalizes classical reinforcement learning by replacing or augmenting scalar reward supervision with preference-based signals. Depending on the problem domain, preferences may take the form of pairwise or k-wise trajectory comparisons, soft-ordinal labels, multi-channel constraints, or explicit preference vectors over objectives.

A unifying formalization is as follows:

  • State and Action Spaces: S,A\mathcal{S}, \mathcal{A} (arbitrary, possibly high-dimensional)
  • Preference Signal:
    • In single-objective tasks: Human or synthetic evaluators provide pairwise preference labels y{0,0.5,1}y \in \{0, 0.5, 1\} on trajectory segments (σA,σB)(\sigma^A, \sigma^B) indicating “no preference,” “AA preferred,” or “BB preferred.”
    • In multi-objective or conditional treatment effect setups: A preference vector wΔm1w \in \Delta^{m-1} (the unit simplex in Rm\mathbb{R}^m) specifies the trade-off over mm objectives.
  • Policy Class: A preference-conditioned policy is a mapping πθ:S×WΔ(A)\pi_\theta: \mathcal{S} \times \mathcal{W} \rightarrow \Delta(\mathcal{A}); at each step, action selection is conditioned not only on the state but also explicitly on a preference specification ww.
  • Learning Objective: Depending on context:
    • Single-objective/Reward Modeling: Learn a latent reward rϕ(s,a)r_\phi(s,a) or directly optimize policy parameters such that induced behaviors align with given preferences, either via indirect ranking loss (e.g., cross-entropy Bradley–Terry) or direct contrastive score maximization (An et al., 2023).
    • Multi-objective: Maximize a scalarization g(R(w),w)g(R(w),w) over cumulative vector rewards R(w)R(w), for any ww, ensuring Pareto coverage (Janmohamed et al., 2024, Jiang et al., 28 Jan 2026).
    • Conditional Treatment Effects: Estimate a conditional preference-based effect qW(x)q_W(x) and learn a policy optimizing a value functional V(π)V(\pi) under preference rule ww (Parnas et al., 3 Feb 2026).

This framework encompasses scenario-specific adaptations—human-in-the-loop online RL (Ji et al., 10 Aug 2025), synthetic preference generation via LLMs (Shen et al., 2024), direct preference-driven policy search (An et al., 2023), and multi-objective optimization via preference vectors (Janmohamed et al., 2024, Jiang et al., 28 Jan 2026).

2. Preference Extraction, Conversion, and Aggregation

A critical challenge is the conversion of weak or unstructured feedback (scalar or language) into strict ordinal preference data amenable to machine learning algorithms:

  • Sliding-Window Preference Extraction: Pref-GUIDE (Ji et al., 10 Aug 2025) transforms scalar signals over agent trajectories into O(n2n^2) local binary or no-preference comparisons within short time windows (typical window n=10n=10, margin δ=5%\delta=5\%), imposing temporal stationarity and filtering ambiguous feedback.
  • Population-Level Voting: Aggregates evaluator-specific reward models into a unified consensus by computing soft votes across evaluators, yielding a population-level preference dataset for more robust reward modeling (Ji et al., 10 Aug 2025).
  • LLM-Driven Preference Synthesis: LLM4PG (Shen et al., 2024) leverages trajectory-to-language abstraction and presents pairs of summarized behaviors to LLMs, which output preference comparisons. These are then used to fit ranking-based reward models.
  • Treatment Effect Preferences: In causal inference, the Conditional Preference-based Treatment Effect (CPTE) leverages sample-based estimates via preference-comparison functions w(yy)w(y|y') and is estimated via matching, quantile, or distributional regression, fully abstracting away the need for explicit scalar rewards (Parnas et al., 3 Feb 2026).
  • Crowding and Scalarization for MO-Policy Learning: In multi-objective/quality-diversity contexts, preference vectors guide exploration along the Pareto frontier, supported by crowding metrics and appropriate scalarizations (e.g., Smooth Tchebycheff) (Janmohamed et al., 2024, Jiang et al., 28 Jan 2026).

3. Policy Learning, Scalarization, and Optimization

Policy optimization under PCPL adapts to preference signals in several modalities:

  • Reward Model Fitting + Standard RL: Preference-labeled trajectory pairs are used to fit a reward model rϕr_\phi, typically via Bradley–Terry or cross-entropy loss, after which standard deep RL methods (DDPG, PPO) are run on the learned dense rewards (Ji et al., 10 Aug 2025, Shen et al., 2024).
  • Direct Contrastive Policy Optimization: Policies are trained directly to maximize a score measuring their consistency with the preference dataset, entirely bypassing intermediate reward modeling. DPPO implements this via a contrastive distance-based score sλ(π,σi,σj)s_\lambda(\pi, \sigma^i, \sigma^j) with conservativeness regularizers to prevent degenerate solutions (An et al., 2023).
  • Preference-Conditioned Gradients in Multi-Objective RL: PCPL applies gradient-based updates conditioned on a sampled preference vector ww:

J(θ;w)=Eπθ(,w)[t=0Tγtwrt]J(\theta; w) = \mathbb{E}_{\pi_\theta(\cdot|\cdot, w)} \left[ \sum_{t=0}^T \gamma^t w^\top \mathbf{r}_t \right]

with analytic gradients and crowding mechanisms for Pareto uniformity (Janmohamed et al., 2024, Jiang et al., 28 Jan 2026).

  • Constrained Optimization via Lagrangian Methods: Multi-Preference Actor Critic (M-PAC) treats each preference as a per-step cost and enforces constraints via adaptive Lagrange multipliers within the policy gradient update, preserving multiple “feedback channel” desiderata alongside classical reward maximization (Durugkar et al., 2019).
  • Causal Policy Learning from Preference-Based Effects: For treatment policy discovery, policies are selected to maximize preference-based value V(π)V(\pi) with plug-in or efficient influence-function corrections, using preference-labeled outcome models to drive weighted classification or tree-based assignment (Parnas et al., 3 Feb 2026).

4. Metrics, Benchmarks, and Evaluation Protocols

PCPL requires evaluation metrics tailored to its preference-centric structure:

Metric/Class Domain of Use Definition / Remarks
Bradley–Terry Ranking Loss Reward modeling for pairwise preference data Cross-entropy over preference probabilities
Mean Episode Return RL, RLHF Standard RL performance using learned reward
Fraction Passing Milestones RL with human feedback Evaluator-level success rate
Hypervolume Ratio (HV) Multi-objective RL, QD archives Relative size to true Pareto front
Proportion Non-Dominated (PNDS) Multi-objective, complex allocation Fraction of roll-out points not dominated by others for given preferences (Jiang et al., 28 Jan 2026)
Ordering Score (OS) Preference-alignment in MO-RL Mean monotonicity (rank correlation) between preference input wiw_i and reward JiJ_i (Jiang et al., 28 Jan 2026)

A salient development is the introduction of GraphAllocBench (Jiang et al., 28 Jan 2026), providing scalable, high-dimensional benchmark suites with real, combinatorial allocation problems and graph-based policy architectures, systematically exposing PCPL strengths and limitations.

5. Theoretical Underpinnings and Policy Existence

Standard reward-based RL assumes additive, real-valued returns and Markovian structure. PCPL generalizes this with formal guarantees:

  • Direct Preference Process (DPP) Framework: Formulates decision making on arbitrary (history-dependent, non-Markovian) interactions, where preferences are modeled as total, consistent preorders on distributions of (random) trajectories rather than on aggregate rewards. Under minimal totality and consistency, one can guarantee existence of deterministic optimal policies and recursive (Bellman-like) optimality equations, even absent any reward function or convexity/interpolation properties (Carr et al., 2023).
  • Identifiability in Treatment Effect Estimation: CPTE replaces unidentifiable individual treatment contrasts (when outcomes can only be compared) with population-level contrasts between independently drawn treated and control outcomes at each X=xX=x, which are identifiable under standard causal assumptions. This justifies policy learning directly on preference-discriminating estimands even in the absence of numeric rewards (Parnas et al., 3 Feb 2026).

6. Empirical Results, Limitations, and Extensions

Across domains, PCPL demonstrates superior sample efficiency, greater flexibility, and robustness over scalar reward-based and hand-designed approaches:

  • Human-in-the-loop RL (Pref-GUIDE): Consensus preference learning with sliding window extraction and voting aggregation outperforms direct scalar-feedback regression and even heuristic dense-reward baselines in complex environments (Ji et al., 10 Aug 2025).
  • LLM-based Preference Synthesis (LLM4PG): Accelerates RL convergence and enables seamless integration of language-conditioned constraints in MiniGrid tasks, outperforming both hand-shaped and original sparse rewards (Shen et al., 2024).
  • Multi-objective PCPL (MOME-P2C, GraphAllocBench): Achieves higher coverage and smoother Pareto trade-off distribution at lower computational and storage cost relative to per-objective actor-critic alternatives (Janmohamed et al., 2024, Jiang et al., 28 Jan 2026).
  • Direct Preference Optimization (DPPO): Outperforms two-step reward modeling baselines and even true-reward offline RL on high-dimensional continuous control and LLM fine-tuning (An et al., 2023).
  • Treatment Policy Learning: Distributional preference-based policies outperform standard CATE-based policies, especially when outcomes are heavy-tailed or lexicographically ordered (Parnas et al., 3 Feb 2026).

Notable limitations include:

  • Sensitivity to hyperparameter choices (e.g., window size nn, margin δ\delta).
  • Assumptions of sufficient density and informativeness of feedback for effective preference extraction and aggregation.
  • Inability of simple scalarizations to capture non-convex, discontinuous Pareto fronts, necessitating richer scalarization and search strategies (Jiang et al., 28 Jan 2026).
  • Possible propagation of LLM biases in synthetic preference generation (Shen et al., 2024).
  • Challenges in scaling ordinal preference-based evaluation to rare or non-Markovian event spaces (Carr et al., 2023).

7. Future Directions and Research Opportunities

Promising avenues for advancing PCPL include:

  • Dynamic, online adaptation of conversion and aggregation mechanisms (e.g., window size, margin) to feedback variance and informativeness (Ji et al., 10 Aug 2025).
  • Integration of natural language explanations and explanations-from-feedback into richer preference models (Ji et al., 10 Aug 2025, Shen et al., 2024).
  • Development of scalarization and exploration techniques capable of handling sharply non-convex, discontinuous, or sparse reward/policy landscapes (Jiang et al., 28 Jan 2026).
  • Extension to risk-sensitive, distributional PCPL, and meta-RL or hypernetwork-based rapid adaptation to entirely novel preference regimes (Jiang et al., 28 Jan 2026).
  • Theoretical and practical frameworks for policy optimality under arbitrary ordinal, possibly non-Markovian preference relations, enabling robust RLHF and preference learning in LLMs and interactive generative agents (Carr et al., 2023).
  • Online and interactive preference elicitation for real-time PCPL deployment in high-stakes domains such as healthcare policy, city management, and automated scientific discovery (Jiang et al., 28 Jan 2026, Parnas et al., 3 Feb 2026).

Preference-Conditioned Policy Learning thus synthesizes developments in reward modeling, preference propagation, constrained optimization, and multi-objective and causal inference, providing a scalable and theoretically grounded approach to learning policies that align with complex, evolving, and possibly human-driven objectives.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Preference-Conditioned Policy Learning (PCPL).