DemPref: Integrating Demos & Preferences

Updated 23 November 2025

DemPref is a framework that combines demonstrations and preference queries to efficiently infer reward functions and policies.
It employs Bayesian inference and iterated correction methods to reduce query requirements and accelerate convergence in high-dimensional settings.
Recent extensions explicitly model latent user-type heterogeneity using EM algorithms, ensuring robust and fair policy optimization.

DemPref refers to a class of machine learning frameworks that integrate demonstrations and preference information, or explicitly model demographic or user-type preference heterogeneity, with the goal of more efficiently and robustly learning reward functions or policies that align with diverse or latent user values. DemPref approaches have been developed across reward learning for robotics, preference-based policy optimization, and multi-objective decision-making. Central features include Bayesian inference from demonstrations, active preference querying, and explicit handling of preference heterogeneity via latent-type models or demographic adaptation.

1. Problem Settings and Motivation

DemPref methods address the challenge of learning reward functions or generative models that reflect the underlying objectives of human users—especially when those objectives are high-dimensional, poorly specified, or heterogeneous. Standard inverse reinforcement learning (IRL) typically relies on expert demonstrations to infer reward functions, assuming optimality or near-optimality and homogeneity across users. Conversely, preference-based learning queries user preferences between alternative trajectories or outputs, but can be inefficient or impractical if user feedback is limited or if preferences vary substantially across demographic groups or latent types.

DemPref frameworks are designed to:

Leverage both demonstrations (which provide strong priors on plausible behaviors) and preference queries (which refine alignment with user intent).
Accommodate unobserved or observed preference heterogeneity, enabling models to adapt to or robustly aggregate over subpopulations rather than assuming population-level uniformity.
Provide scalable and efficient learning algorithms suitable for high-dimensional or multi-objective domains (Palan et al., 2019, Chidambaram et al., 2024, Lu et al., 2024).

2. Bayesian Integration of Demonstrations and Preferences

DemPref was originally introduced in the context of reward learning for autonomous robots, formalizing a Bayesian two-stage process: (1) demonstrations induce a posterior (or prior) over reward parameters, and (2) active preference queries iteratively refine the posterior (Palan et al., 2019).

Given a parameterized linear reward $R(\xi; w) = w \cdot \Phi(\xi)$ with bounded $w\in \mathbb{R}^k$ , a demonstration $D_i$ induces likelihood $P(D_i\mid w) \propto \exp(\beta^D w \cdot \Phi(D_i))$ , and a set of demonstrations $\mathcal{B}$ yields a posterior $P(w\mid \mathcal{B}) \propto 1_{\|w\|\leq1} \exp(\beta^D \sum_{i=1}^n w\cdot \Phi(D_i))$ . Preferences over trajectory pairs $(\xi_1,\xi_2)$ are modeled by the Bradley–Terry–Luce likelihood $P(\xi_1 \succ \xi_2 \mid w) = \frac{\exp(\beta^R w\cdot \Phi(\xi_1))}{\exp(\beta^R w\cdot \Phi(\xi_1)) + \exp(\beta^R w\cdot \Phi(\xi_2))}$ .

Iterative Bayesian updates combine these inputs, with demonstrations acting as strong priors to reduce the effective search space, and preferences focusing the learned reward function within this space.

3. Active Preference Querying and Iterated Correction

A core methodological advance in DemPref is the generation of preference queries that maximize information gain (maximum-volume removal) with respect to the current posterior over rewards. Demonstrations are not only used to initialize the posterior but also to ground the preference query process. The method employs “iterated correction,” where a stored trajectory (often a demonstration) is iteratively updated to the user’s most-preferred choice among new candidate trajectories, further accelerating convergence.

Queries can be pairwise or rankings over $n$ options (modeled by Plackett–Luce), with the query set optimized to maximize the minimum expected information gain across possible user responses.

Numerical experiments in simulated domains (Driver, Lunar Lander, Fetch Reach) demonstrate substantial query efficiency—demonstrations can reduce the number of required queries by a factor of 3–4, and iterated correction further halves the effective query cost for a given alignment metric (Palan et al., 2019).

4. Inference of Preferences from Demonstration in Multi-Objective Settings

In multi-objective Markov decision problems (MOMDPs), demonstration-based preference inference (DemoPI) extends DemPref to infer scalarization weights over multiple objectives (e.g., cost and comfort in energy management) (Lu et al., 2024).

The DWPI (Dynamic Weight-based Preference Inference) algorithm employs a dynamic-weight multi-objective RL (DWMORL) agent trained to cover the full weight simplex. Given a user demonstration $D=\{(s_t,a_t)\}_{t=1}^T$ , a compact representation (such as cumulative vector reward) $\mathbf{x} = \sum_t \mathbf{r}_t$ is mapped to a predicted weight $\hat{\mathbf{w}}$ via a supervised regression model trained on synthetic policy rollouts.

Experimental results in energy scheduling domains show that DWPI accurately recovers user weights in all tested scenarios, with trained RL agents replicating user-style cost-comfort trade-offs over extended test periods. The method requires only a single DWMORL training plus lightweight inference, and robustly handles rule-based, heuristic (suboptimal) demonstrations (Lu et al., 2024).

5. Explicit Modeling of Preference Heterogeneity

Standard preference optimization (e.g., DPO) assumes annotation population homogeneity. Recent extensions adapt DemPref to preference heterogeneity by positing $K$ latent types $z_1,\ldots,z_K$ per annotator (Chidambaram et al., 2024).

Given a dataset of binary preferences grouped by annotator, a type mixture model is trained:

Each annotator is assigned a hidden type from prior $\eta\in\Delta^K$ .
Each type $z_k$ has a specialized policy $\pi_{\phi,z_k}$ .
An EM algorithm alternates between inferring type posteriors based on the Bradley–Terry–Luce likelihoods and updating type priors and per-type DPO policy parameters.

Upon EM convergence, this yields a set of type-conditional optimal policies $\{\pi^*_{z_k}\}$ and type mixture weights $\eta^*$ .

Further, a min–max regret ensemble is constructed to yield a single policy within the convex hull of per-type policies that minimizes worst-case subgroup regret,

$\min_{w\in\Delta^K} \max_{k=1,\dots,K}\sum_{\ell=1}^K w_\ell\,({\cal L}_{z_k,z_k} - {\cal L}_{z_k,z_\ell}),$

with ${\cal L}_{z,z'}$ denoting log-likelihood discrepancies between policies. This framework accommodates both latent and observed demographic clusters.

If demographic group labels are available, the membership can be fixed; otherwise, EM recovers clusters that often correlate with demographics (Chidambaram et al., 2024).

6. Practical Applications and Empirical Results

DemPref methodologies have been applied in robotics, generative LLM alignment, and multi-objective decision problems, with demonstration in the following settings:

Domain	DemPref Variant	Key Outcomes
Robotics (Fetch Arm, Driver, Lander)	Demonstration + Preference	3–4× query efficiency; higher user preference alignment (Palan et al., 2019)
Energy Management (MORL)	DemoPI/DWPI	Accurate recovery of user trade-offs; behavioral alignment (Lu et al., 2024)
Generative Models (RLHF/DPO)	EM-DPO + MinMax-DPO	Equitable policy serving diverse preferences (Chidambaram et al., 2024)

In user studies comparing DemPref to classic IRL, participants rated robots trained with DemPref-based methods as significantly better at accomplishing tasks and aligning with user intent (p≈0.02), with no evidence of increased user burden (Palan et al., 2019). In MORL, DWPI enables practical and interpretable user preference extraction, suitable for embedded systems (Lu et al., 2024). Experiments in RLHF/DPO demonstrate that explicit handling of preference heterogeneity improves fairness and regret guarantees across annotator subgroups (Chidambaram et al., 2024).

7. Limitations and Future Research Directions

Several limitations persist in current DemPref research:

Query optimization can incur significant computation time due to nonconvexity, raising latency concerns (Palan et al., 2019).
Cognitive load of ranking or comparing multiple trajectories can challenge annotators, especially in high-dimensional spaces.
Most methods depend on a feature-based (often linear) reward representation; feature misspecification can degrade performance.
For modeling demographic heterogeneity, label availability, subgroup data scarcity, and privacy/fairness compliance are practical challenges. Rare or intersectional subgroups may require priors or parameter sharing (Chidambaram et al., 2024).
Demonstrations in some settings are synthetic or rule-based; validation on fully human data is pending in certain domains (Lu et al., 2024).

Research directions include distributed query optimization, interpretable query generation, nonlinear reward learning (e.g., deep features), richer feedback modalities, and multi-agent or intersectional extensions. In the context of demographic preference modeling, there is an ongoing focus on robust aggregation, fairness, and privacy-preserving inference.

Overall, DemPref unifies demonstration and preference learning under a Bayesian framework, and, in its recent incarnations, extends to nuanced modeling of heterogeneous or latent human objectives, with quantitatively validated benefits in efficiency, robustness, and equitable policy alignment (Palan et al., 2019, Lu et al., 2024, Chidambaram et al., 2024).

PDF Markdown Chat (Pro)

References (3)

Learning Reward Functions by Integrating Human Demonstrations and Preferences (2019)

Direct Preference Optimization With Unobserved Preference Heterogeneity (2024)

Inferring Preferences from Demonstrations in Multi-Objective Residential Energy Management (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DemPref.

DemPref: Integrating Demos & Preferences

1. Problem Settings and Motivation

2. Bayesian Integration of Demonstrations and Preferences

3. Active Preference Querying and Iterated Correction

4. Inference of Preferences from Demonstration in Multi-Objective Settings

5. Explicit Modeling of Preference Heterogeneity

6. Practical Applications and Empirical Results

7. Limitations and Future Research Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DemPref: Integrating Demos & Preferences

1. Problem Settings and Motivation

2. Bayesian Integration of Demonstrations and Preferences

3. Active Preference Querying and Iterated Correction

4. Inference of Preferences from Demonstration in Multi-Objective Settings

5. Explicit Modeling of Preference Heterogeneity

6. Practical Applications and Empirical Results

7. Limitations and Future Research Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research