Papers
Topics
Authors
Recent
2000 character limit reached

DemPref: Integrating Demos & Preferences

Updated 23 November 2025
  • DemPref is a framework that combines demonstrations and preference queries to efficiently infer reward functions and policies.
  • It employs Bayesian inference and iterated correction methods to reduce query requirements and accelerate convergence in high-dimensional settings.
  • Recent extensions explicitly model latent user-type heterogeneity using EM algorithms, ensuring robust and fair policy optimization.

DemPref refers to a class of machine learning frameworks that integrate demonstrations and preference information, or explicitly model demographic or user-type preference heterogeneity, with the goal of more efficiently and robustly learning reward functions or policies that align with diverse or latent user values. DemPref approaches have been developed across reward learning for robotics, preference-based policy optimization, and multi-objective decision-making. Central features include Bayesian inference from demonstrations, active preference querying, and explicit handling of preference heterogeneity via latent-type models or demographic adaptation.

1. Problem Settings and Motivation

DemPref methods address the challenge of learning reward functions or generative models that reflect the underlying objectives of human users—especially when those objectives are high-dimensional, poorly specified, or heterogeneous. Standard inverse reinforcement learning (IRL) typically relies on expert demonstrations to infer reward functions, assuming optimality or near-optimality and homogeneity across users. Conversely, preference-based learning queries user preferences between alternative trajectories or outputs, but can be inefficient or impractical if user feedback is limited or if preferences vary substantially across demographic groups or latent types.

DemPref frameworks are designed to:

  • Leverage both demonstrations (which provide strong priors on plausible behaviors) and preference queries (which refine alignment with user intent).
  • Accommodate unobserved or observed preference heterogeneity, enabling models to adapt to or robustly aggregate over subpopulations rather than assuming population-level uniformity.
  • Provide scalable and efficient learning algorithms suitable for high-dimensional or multi-objective domains (Palan et al., 2019, Chidambaram et al., 23 May 2024, Lu et al., 15 Jan 2024).

2. Bayesian Integration of Demonstrations and Preferences

DemPref was originally introduced in the context of reward learning for autonomous robots, formalizing a Bayesian two-stage process: (1) demonstrations induce a posterior (or prior) over reward parameters, and (2) active preference queries iteratively refine the posterior (Palan et al., 2019).

Given a parameterized linear reward R(ξ;w)=wΦ(ξ)R(\xi; w) = w \cdot \Phi(\xi) with bounded wRkw\in \mathbb{R}^k, a demonstration DiD_i induces likelihood P(Diw)exp(βDwΦ(Di))P(D_i\mid w) \propto \exp(\beta^D w \cdot \Phi(D_i)), and a set of demonstrations B\mathcal{B} yields a posterior P(wB)1w1exp(βDi=1nwΦ(Di))P(w\mid \mathcal{B}) \propto 1_{\|w\|\leq1} \exp(\beta^D \sum_{i=1}^n w\cdot \Phi(D_i)). Preferences over trajectory pairs (ξ1,ξ2)(\xi_1,\xi_2) are modeled by the Bradley–Terry–Luce likelihood P(ξ1ξ2w)=exp(βRwΦ(ξ1))exp(βRwΦ(ξ1))+exp(βRwΦ(ξ2))P(\xi_1 \succ \xi_2 \mid w) = \frac{\exp(\beta^R w\cdot \Phi(\xi_1))}{\exp(\beta^R w\cdot \Phi(\xi_1)) + \exp(\beta^R w\cdot \Phi(\xi_2))}.

Iterative Bayesian updates combine these inputs, with demonstrations acting as strong priors to reduce the effective search space, and preferences focusing the learned reward function within this space.

3. Active Preference Querying and Iterated Correction

A core methodological advance in DemPref is the generation of preference queries that maximize information gain (maximum-volume removal) with respect to the current posterior over rewards. Demonstrations are not only used to initialize the posterior but also to ground the preference query process. The method employs “iterated correction,” where a stored trajectory (often a demonstration) is iteratively updated to the user’s most-preferred choice among new candidate trajectories, further accelerating convergence.

Queries can be pairwise or rankings over nn options (modeled by Plackett–Luce), with the query set optimized to maximize the minimum expected information gain across possible user responses.

Numerical experiments in simulated domains (Driver, Lunar Lander, Fetch Reach) demonstrate substantial query efficiency—demonstrations can reduce the number of required queries by a factor of 3–4, and iterated correction further halves the effective query cost for a given alignment metric (Palan et al., 2019).

4. Inference of Preferences from Demonstration in Multi-Objective Settings

In multi-objective Markov decision problems (MOMDPs), demonstration-based preference inference (DemoPI) extends DemPref to infer scalarization weights over multiple objectives (e.g., cost and comfort in energy management) (Lu et al., 15 Jan 2024).

The DWPI (Dynamic Weight-based Preference Inference) algorithm employs a dynamic-weight multi-objective RL (DWMORL) agent trained to cover the full weight simplex. Given a user demonstration D={(st,at)}t=1TD=\{(s_t,a_t)\}_{t=1}^T, a compact representation (such as cumulative vector reward) x=trt\mathbf{x} = \sum_t \mathbf{r}_t is mapped to a predicted weight w^\hat{\mathbf{w}} via a supervised regression model trained on synthetic policy rollouts.

Experimental results in energy scheduling domains show that DWPI accurately recovers user weights in all tested scenarios, with trained RL agents replicating user-style cost-comfort trade-offs over extended test periods. The method requires only a single DWMORL training plus lightweight inference, and robustly handles rule-based, heuristic (suboptimal) demonstrations (Lu et al., 15 Jan 2024).

5. Explicit Modeling of Preference Heterogeneity

Standard preference optimization (e.g., DPO) assumes annotation population homogeneity. Recent extensions adapt DemPref to preference heterogeneity by positing KK latent types z1,,zKz_1,\ldots,z_K per annotator (Chidambaram et al., 23 May 2024).

Given a dataset of binary preferences grouped by annotator, a type mixture model is trained:

  • Each annotator is assigned a hidden type from prior ηΔK\eta\in\Delta^K.
  • Each type zkz_k has a specialized policy πϕ,zk\pi_{\phi,z_k}.
  • An EM algorithm alternates between inferring type posteriors based on the Bradley–Terry–Luce likelihoods and updating type priors and per-type DPO policy parameters.

Upon EM convergence, this yields a set of type-conditional optimal policies {πzk}\{\pi^*_{z_k}\} and type mixture weights η\eta^*.

Further, a min–max regret ensemble is constructed to yield a single policy within the convex hull of per-type policies that minimizes worst-case subgroup regret,

minwΔKmaxk=1,,K=1Kw(Lzk,zkLzk,z),\min_{w\in\Delta^K} \max_{k=1,\dots,K}\sum_{\ell=1}^K w_\ell\,({\cal L}_{z_k,z_k} - {\cal L}_{z_k,z_\ell}),

with Lz,z{\cal L}_{z,z'} denoting log-likelihood discrepancies between policies. This framework accommodates both latent and observed demographic clusters.

If demographic group labels are available, the membership can be fixed; otherwise, EM recovers clusters that often correlate with demographics (Chidambaram et al., 23 May 2024).

6. Practical Applications and Empirical Results

DemPref methodologies have been applied in robotics, generative LLM alignment, and multi-objective decision problems, with demonstration in the following settings:

Domain DemPref Variant Key Outcomes
Robotics (Fetch Arm, Driver, Lander) Demonstration + Preference 3–4× query efficiency; higher user preference alignment (Palan et al., 2019)
Energy Management (MORL) DemoPI/DWPI Accurate recovery of user trade-offs; behavioral alignment (Lu et al., 15 Jan 2024)
Generative Models (RLHF/DPO) EM-DPO + MinMax-DPO Equitable policy serving diverse preferences (Chidambaram et al., 23 May 2024)

In user studies comparing DemPref to classic IRL, participants rated robots trained with DemPref-based methods as significantly better at accomplishing tasks and aligning with user intent (p≈0.02), with no evidence of increased user burden (Palan et al., 2019). In MORL, DWPI enables practical and interpretable user preference extraction, suitable for embedded systems (Lu et al., 15 Jan 2024). Experiments in RLHF/DPO demonstrate that explicit handling of preference heterogeneity improves fairness and regret guarantees across annotator subgroups (Chidambaram et al., 23 May 2024).

7. Limitations and Future Research Directions

Several limitations persist in current DemPref research:

  • Query optimization can incur significant computation time due to nonconvexity, raising latency concerns (Palan et al., 2019).
  • Cognitive load of ranking or comparing multiple trajectories can challenge annotators, especially in high-dimensional spaces.
  • Most methods depend on a feature-based (often linear) reward representation; feature misspecification can degrade performance.
  • For modeling demographic heterogeneity, label availability, subgroup data scarcity, and privacy/fairness compliance are practical challenges. Rare or intersectional subgroups may require priors or parameter sharing (Chidambaram et al., 23 May 2024).
  • Demonstrations in some settings are synthetic or rule-based; validation on fully human data is pending in certain domains (Lu et al., 15 Jan 2024).

Research directions include distributed query optimization, interpretable query generation, nonlinear reward learning (e.g., deep features), richer feedback modalities, and multi-agent or intersectional extensions. In the context of demographic preference modeling, there is an ongoing focus on robust aggregation, fairness, and privacy-preserving inference.


Overall, DemPref unifies demonstration and preference learning under a Bayesian framework, and, in its recent incarnations, extends to nuanced modeling of heterogeneous or latent human objectives, with quantitatively validated benefits in efficiency, robustness, and equitable policy alignment (Palan et al., 2019, Lu et al., 15 Jan 2024, Chidambaram et al., 23 May 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DemPref.