Papers
Topics
Authors
Recent
2000 character limit reached

Interactive Imitation Learning (HG-DAgger)

Updated 7 January 2026
  • Interactive imitation learning is a framework where expert interventions correct a novice's actions to overcome distributional shift and error compounding.
  • HG-DAgger introduces a human-controlled gating schedule with uncertainty-driven risk metrics to aggregate high-fidelity data for policy training.
  • Extensions such as LazyDAgger and ThriftyDAgger optimize query efficiency and minimize supervisory burden, enhancing real-world safety and performance.

Interactive imitation learning denotes a paradigm wherein the expert policy is queried online to generate corrective supervisory actions in response to the current state distribution induced by a learning agent. Human-Gated DAgger (HG-DAgger) is an influential instance within this framework, addressing the practical and theoretical limitations of classical DAgger when deploying human experts in real-world continuous control domains. HG-DAgger introduces a dynamic, supervisor-controlled “gating schedule” and learns a risk-based intervention metric, enabling efficient, safe data aggregation from human supervisors. This methodology has catalyzed a rich ecosystem of successor algorithms, including robot-gated, budgeted, active, and hierarchical extensions targeting human-aware supervision, cost-sensitive intervention, and query efficiency.

1. Interactive Imitation Learning and Human-Gated DAgger Fundamentals

Imitation learning aims to train a policy πN\pi_N that replicates the behavior of an expert πH\pi_H, conventionally through a supervised learning approach called behavioral cloning (BC). BC suffers from distributional shift and compounding error: since the agent is only exposed to expert-visited states during training, errors accumulate rapidly in previously unseen regions. The DAgger algorithm mitigates this problem by iteratively rolling out the current novice, sampling states from the induced distribution, and querying the expert for corrective actions to augment the training set, thereby reducing regret from O(H2)O(H^2) to O(H)O(H) in the episode horizon HH (Kelly et al., 2018).

However, standard DAgger typically uses a stochastic β\beta-mixture schedule between expert and novice actions, which is incompatible with the constraints of human experts in physical systems. Human experts require uninterrupted control to provide high-quality corrective labels and maintain safety. HG-DAgger resolves this by introducing a gating function g(xt)g(x_t) under human control, so that at each state xtx_t, the system either delegates to the expert (g(xt)=1g(x_t) = 1) or the novice (g(xt)=0g(x_t) = 0) (Kelly et al., 2018). Data is aggregated exclusively from periods of human control, ensuring label fidelity and safe interactive learning.

2. Safety-Risk Metrics, Threshold Tuning, and Data Aggregation

A key innovation of HG-DAgger is its uncertainty-based risk metric derived from an ensemble of neural network policies. For an observation oto_t, the novice ensemble outputs a covariance matrix CtC_t, and the “doubt” dN(ot)=diag(Ct)2d_N(o_t) = \|\mathrm{diag}(C_t)\|_2 quantifies the model’s epistemic uncertainty. Throughout training, dNd_N is logged whenever the human intervenes, producing a dataset of “risk values” at intervention points.

Threshold selection for automated risk gating at deployment is handled statistically: the safety threshold τ\tau is set to the mean of the top 25% of recorded dNd_N values, focusing on high-risk interventions late in learning. At evaluation or in autonomous operation, the system switches to a fallback if dN(ot)>τd_N(o_t) > \tau, implementing an empirical risk-set boundary (Kelly et al., 2018).

All data aggregation in HG-DAgger is one-sided: labels are collected only when the human is in control. After each epoch, the novice is retrained on the aggregated dataset, and the risk metric is updated to reflect the latest policy uncertainty landscape.

3. Supervisory Burden, Human-Aware Extensions, and Query Efficiency

HG-DAgger emphasizes minimizing human burden by controlling when and for how long interventions occur. Subsequent variants such as LazyDAgger (Hoque et al., 2021), ThriftyDAgger (Hoque et al., 2021), and other budgeted schemes extend this principle by formalizing supervisor burden as a function of context-switch cost LL per intervention, total number of switches C(πswitch)C(\pi_{\mathrm{switch}}), and cumulative human action count I(πswitch)I(\pi_{\mathrm{switch}}).

ThriftyDAgger replaces human gating with automatically learned robot-gated intervention triggers, based on state novelty (ensemble action variance) and risk (estimated probability of failing to reach the goal as assessed by a learned critic QϕQ_\phi) (Hoque et al., 2021). Quantile-thresholding aligns intervention frequency with a user-specified budget αsupdes\alpha_{\mathrm{sup}}^{\mathrm{des}}, yielding high task success rates with minimal queries in both simulated and robotic experiments. LazyDAgger, by imposing asymmetric “hysteresis” thresholds for entering/exiting supervisor mode, achieves significant reduction in context switches (up to 79% in MuJoCo domains) while matching SafeDAgger and DAgger policy performance (Hoque et al., 2021).

This direction has facilitated the design of query-efficient active imitation learning frameworks, including adversarial reward-model-based approaches and hierarchical decomposition, which restrict queries to the most value-informative or risky states (Hsu, 2019, Niu et al., 2020).

4. Sample Complexity, Theoretical Guarantees, and Comparative Analysis

Classic DAgger is provided with no-regret online learning guarantees, ensuring bounded regret with respect to the expert policy under mild assumptions. HG-DAgger inherits these guarantees, provided that the gating operates to prevent significant distributional drift. Recent theoretical investigations have focused on sample complexity and optimality for interactive variants:

  • One-sample-per-round DAgger and first-step mixture DAgger match or exceed the minimax-optimal rate of behavior cloning (under log loss and appropriate mixing), thereby offering sharper guarantees in settings with realizable deterministic experts (Li et al., 2024).
  • RLIF, an RL-based interactive imitation framework, replaces the action-optimality requirement of the expert with the assumption that expert interventions are correlated with poor learner actions (P(intervenes)>β>1/2P(\mathrm{intervene}|s)>\beta>1/2 for suboptimal QQ). RLIF’s regret degrades at most as (1β)/β(1-\beta)/\beta above the stronger of the expert or reference policies and can outperform DAgger under suboptimal or inconsistent human correction (Luo et al., 2023).
  • Empirical studies demonstrate that human-gated and robot-gated interactive learning dramatically outperform both behavior cloning and standard DAgger in safety-critical and distribution-shifting regimes (Kelly et al., 2018, Hoque et al., 2021).

5. Hierarchical, Active, and Multi-Expert Interactive Extensions

The interactive imitation learning ecosystem now encompasses hierarchical and multi-expert variants inspired by HG-DAgger principles. Hierarchical systems—such as two-level frameworks for navigation—combine high-level meta-controllers (trained by DAgger with active query strategies) and low-level controllers (learned by RL), successfully reducing demonstration cost and cognitive burden while raising success rates (Niu et al., 2020, Bi et al., 2018).

MEGA-DAgger generalizes HG-DAgger to the case of multiple, imperfect experts, employing safety-filtered data aggregation and conflict resolution among expert-provided labels via control-barrier-based safety metrics and scenario-specific scores (Sun et al., 2023). Other research addresses the challenge of erroneous or suboptimal expert interventions by explicit intervention filtering, reward modeling, and value-sensitive policy updates, including AggreVaTe, which leverages cost-to-go estimates for cost-sensitive imitation (Ross et al., 2014).

6. Use Cases, Experimental Results, and Practical Considerations

Experiments across driving, robotic manipulation, and multi-robot fleet control domains consistently validate the advantages of interactive, human-gated imitation learning:

  • HG-DAgger delivers 0 collisions and departures in real-vehicle tests, with its risk-set boundary providing a 12× reduction in collision rate compared to out-of-set states (Kelly et al., 2018).
  • ThriftyDAgger achieves 73% autonomous success rate in MuJoCo peg-insertion (vs. 57% for HG-DAgger), and in a physical da Vinci robot cable routing task, 15/15 interactive task completions with only ~1.42 interventions per 100 steps (Hoque et al., 2021).
  • LazyDAgger reduces context switches by 60% over SafeDAgger in continuous control and physical fabric manipulation with equal or superior success, highlighting the benefit of intelligent gating for real-world supervisors (Hoque et al., 2021).
  • Active learning extensions, such as reward-based trajectory selection, improve subjective human experience and reduce required demonstrations by 30%, as measured in hierarchical navigation tasks (Niu et al., 2020).

Key practical recommendations include calibrating intervention thresholds from late-stage risk-logs, tuning ensemble size and critic capacity to balance informativeness with computational tractability, and formalizing supervisor burden to integrate human factors into algorithmic design.

7. Limitations and Open Directions

Current implementations of HG-DAgger and its extensions depend on the fidelity and responsiveness of human supervisors, which may limit scalability to high-velocity or high-risk domains. Discrete switching between supervisor and learner can still be suboptimal for smooth control tasks. The safety or "risk" thresholds are empirically derived and subject to inaccuracies if the ensemble model fails to capture epistemic uncertainty. Formal guarantees beyond standard no-regret bounds are generally lacking, especially in the presence of imperfect or inconsistent experts, partial observability, and nonstationary environments (Kelly et al., 2018, Hoque et al., 2021, Sun et al., 2023).

Future work prominently targets automatic risk gating without explicit human intervention, theoretical analysis of sample complexity under human latency and error, scalable multi-fleet extensions, and integrated learning from demonstration, intervention, and reward signals for robust robot autonomy. Extensions to interactive RL, hybrid reward-imitation pipelines, and scenarios with limited or dynamically varying intervention budgets remain active research frontiers.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Interactive Imitation Learning (HG-DAgger).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube