DemPref: Integrating Demos & Preferences
- DemPref is a framework that combines demonstrations and preference queries to efficiently infer reward functions and policies.
- It employs Bayesian inference and iterated correction methods to reduce query requirements and accelerate convergence in high-dimensional settings.
- Recent extensions explicitly model latent user-type heterogeneity using EM algorithms, ensuring robust and fair policy optimization.
DemPref refers to a class of machine learning frameworks that integrate demonstrations and preference information, or explicitly model demographic or user-type preference heterogeneity, with the goal of more efficiently and robustly learning reward functions or policies that align with diverse or latent user values. DemPref approaches have been developed across reward learning for robotics, preference-based policy optimization, and multi-objective decision-making. Central features include Bayesian inference from demonstrations, active preference querying, and explicit handling of preference heterogeneity via latent-type models or demographic adaptation.
1. Problem Settings and Motivation
DemPref methods address the challenge of learning reward functions or generative models that reflect the underlying objectives of human users—especially when those objectives are high-dimensional, poorly specified, or heterogeneous. Standard inverse reinforcement learning (IRL) typically relies on expert demonstrations to infer reward functions, assuming optimality or near-optimality and homogeneity across users. Conversely, preference-based learning queries user preferences between alternative trajectories or outputs, but can be inefficient or impractical if user feedback is limited or if preferences vary substantially across demographic groups or latent types.
DemPref frameworks are designed to:
- Leverage both demonstrations (which provide strong priors on plausible behaviors) and preference queries (which refine alignment with user intent).
- Accommodate unobserved or observed preference heterogeneity, enabling models to adapt to or robustly aggregate over subpopulations rather than assuming population-level uniformity.
- Provide scalable and efficient learning algorithms suitable for high-dimensional or multi-objective domains (Palan et al., 2019, Chidambaram et al., 23 May 2024, Lu et al., 15 Jan 2024).
2. Bayesian Integration of Demonstrations and Preferences
DemPref was originally introduced in the context of reward learning for autonomous robots, formalizing a Bayesian two-stage process: (1) demonstrations induce a posterior (or prior) over reward parameters, and (2) active preference queries iteratively refine the posterior (Palan et al., 2019).
Given a parameterized linear reward with bounded , a demonstration induces likelihood , and a set of demonstrations yields a posterior . Preferences over trajectory pairs are modeled by the Bradley–Terry–Luce likelihood .
Iterative Bayesian updates combine these inputs, with demonstrations acting as strong priors to reduce the effective search space, and preferences focusing the learned reward function within this space.
3. Active Preference Querying and Iterated Correction
A core methodological advance in DemPref is the generation of preference queries that maximize information gain (maximum-volume removal) with respect to the current posterior over rewards. Demonstrations are not only used to initialize the posterior but also to ground the preference query process. The method employs “iterated correction,” where a stored trajectory (often a demonstration) is iteratively updated to the user’s most-preferred choice among new candidate trajectories, further accelerating convergence.
Queries can be pairwise or rankings over options (modeled by Plackett–Luce), with the query set optimized to maximize the minimum expected information gain across possible user responses.
Numerical experiments in simulated domains (Driver, Lunar Lander, Fetch Reach) demonstrate substantial query efficiency—demonstrations can reduce the number of required queries by a factor of 3–4, and iterated correction further halves the effective query cost for a given alignment metric (Palan et al., 2019).
4. Inference of Preferences from Demonstration in Multi-Objective Settings
In multi-objective Markov decision problems (MOMDPs), demonstration-based preference inference (DemoPI) extends DemPref to infer scalarization weights over multiple objectives (e.g., cost and comfort in energy management) (Lu et al., 15 Jan 2024).
The DWPI (Dynamic Weight-based Preference Inference) algorithm employs a dynamic-weight multi-objective RL (DWMORL) agent trained to cover the full weight simplex. Given a user demonstration , a compact representation (such as cumulative vector reward) is mapped to a predicted weight via a supervised regression model trained on synthetic policy rollouts.
Experimental results in energy scheduling domains show that DWPI accurately recovers user weights in all tested scenarios, with trained RL agents replicating user-style cost-comfort trade-offs over extended test periods. The method requires only a single DWMORL training plus lightweight inference, and robustly handles rule-based, heuristic (suboptimal) demonstrations (Lu et al., 15 Jan 2024).
5. Explicit Modeling of Preference Heterogeneity
Standard preference optimization (e.g., DPO) assumes annotation population homogeneity. Recent extensions adapt DemPref to preference heterogeneity by positing latent types per annotator (Chidambaram et al., 23 May 2024).
Given a dataset of binary preferences grouped by annotator, a type mixture model is trained:
- Each annotator is assigned a hidden type from prior .
- Each type has a specialized policy .
- An EM algorithm alternates between inferring type posteriors based on the Bradley–Terry–Luce likelihoods and updating type priors and per-type DPO policy parameters.
Upon EM convergence, this yields a set of type-conditional optimal policies and type mixture weights .
Further, a min–max regret ensemble is constructed to yield a single policy within the convex hull of per-type policies that minimizes worst-case subgroup regret,
with denoting log-likelihood discrepancies between policies. This framework accommodates both latent and observed demographic clusters.
If demographic group labels are available, the membership can be fixed; otherwise, EM recovers clusters that often correlate with demographics (Chidambaram et al., 23 May 2024).
6. Practical Applications and Empirical Results
DemPref methodologies have been applied in robotics, generative LLM alignment, and multi-objective decision problems, with demonstration in the following settings:
| Domain | DemPref Variant | Key Outcomes |
|---|---|---|
| Robotics (Fetch Arm, Driver, Lander) | Demonstration + Preference | 3–4× query efficiency; higher user preference alignment (Palan et al., 2019) |
| Energy Management (MORL) | DemoPI/DWPI | Accurate recovery of user trade-offs; behavioral alignment (Lu et al., 15 Jan 2024) |
| Generative Models (RLHF/DPO) | EM-DPO + MinMax-DPO | Equitable policy serving diverse preferences (Chidambaram et al., 23 May 2024) |
In user studies comparing DemPref to classic IRL, participants rated robots trained with DemPref-based methods as significantly better at accomplishing tasks and aligning with user intent (p≈0.02), with no evidence of increased user burden (Palan et al., 2019). In MORL, DWPI enables practical and interpretable user preference extraction, suitable for embedded systems (Lu et al., 15 Jan 2024). Experiments in RLHF/DPO demonstrate that explicit handling of preference heterogeneity improves fairness and regret guarantees across annotator subgroups (Chidambaram et al., 23 May 2024).
7. Limitations and Future Research Directions
Several limitations persist in current DemPref research:
- Query optimization can incur significant computation time due to nonconvexity, raising latency concerns (Palan et al., 2019).
- Cognitive load of ranking or comparing multiple trajectories can challenge annotators, especially in high-dimensional spaces.
- Most methods depend on a feature-based (often linear) reward representation; feature misspecification can degrade performance.
- For modeling demographic heterogeneity, label availability, subgroup data scarcity, and privacy/fairness compliance are practical challenges. Rare or intersectional subgroups may require priors or parameter sharing (Chidambaram et al., 23 May 2024).
- Demonstrations in some settings are synthetic or rule-based; validation on fully human data is pending in certain domains (Lu et al., 15 Jan 2024).
Research directions include distributed query optimization, interpretable query generation, nonlinear reward learning (e.g., deep features), richer feedback modalities, and multi-agent or intersectional extensions. In the context of demographic preference modeling, there is an ongoing focus on robust aggregation, fairness, and privacy-preserving inference.
Overall, DemPref unifies demonstration and preference learning under a Bayesian framework, and, in its recent incarnations, extends to nuanced modeling of heterogeneous or latent human objectives, with quantitatively validated benefits in efficiency, robustness, and equitable policy alignment (Palan et al., 2019, Lu et al., 15 Jan 2024, Chidambaram et al., 23 May 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free