Papers
Topics
Authors
Recent
Search
2000 character limit reached

HG-DAgger: Human-Gated Imitation Learning

Updated 8 January 2026
  • HG-DAgger is an interactive imitation learning algorithm that integrates human interventions with uncertainty-aware risk estimation in autonomous driving.
  • It employs a real-time gating mechanism and an ensemble-based uncertainty metric to selectively balance expert and novice actions during training.
  • Empirical evaluations show HG-DAgger reduces collision and road departure rates, offering improved safety and stability compared to BC and standard DAgger.

HG-DAgger is an interactive imitation learning algorithm specifically designed to accommodate human experts in real-world systems. It addresses inherent deficiencies in classical behavioral cloning (BC) and Dataset Aggregation (DAgger), especially in the context of high-stakes domains like autonomous driving, where compounding errors, safety, and human label quality are central concerns. HG-DAgger introduces a principled, human-gated control interface and uncertainty-aware risk estimation, yielding superior safety, sample efficiency, and human-likeness in learned policies (Kelly et al., 2018).

1. Background: Imitation Learning and DAgger

Imitation learning seeks a novice policy πN\pi_N that closely mimics an expert πH\pi_H, usually through supervised learning on expert demonstrations {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}:

minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.

Behavioral cloning trains πN\pi_N solely on states encountered by πH\pi_H, causing distributional shift—πN\pi_N may experience unfamiliar states at test time, leading to compounding errors. DAgger counters this by stochastically interleaving expert and novice actions during data collection, with a Bernoulli gate (probability β\beta for expert), aggregating states from both policies and thereby reducing the mismatch.

However, traditional DAgger exposes two limitations in human-in-the-loop scenarios: (1) the expert must provide corrective actions in real time without full system control, which degrades safety and label quality; (2) actuator lag exacerbates issues in dynamic settings.

2. Algorithmic Structure of HG-DAgger

HG-DAgger modifies DAgger to ensure that the human expert can gate control at will, fully intervening in real time when the novice policy enters perceived unsafe regions. The components are:

  • Gating Rule: The human defines on-the-fly a permitted set of states PX\mathcal P \subseteq \mathcal X; the gating function operates as

g(xt)={1,xtP 0,xtPg(x_t) = \begin{cases} 1, & x_t \notin \mathcal P \ 0, & x_t \in \mathcal P \end{cases}

where πH\pi_H0 denotes expert control, πH\pi_H1 denotes novice control.

  • Policy Rollout: The joint control policy during data gathering is

πH\pi_H2

where πH\pi_H3 is the observation available to the novice.

  • Uncertainty (‘Doubt’) Metric: The novice πH\pi_H4 is an ensemble of πH\pi_H5 neural networks, each producing action πH\pi_H6 for input πH\pi_H7. Empirical covariance is computed as

πH\pi_H8

The scalar doubt is defined as

πH\pi_H9

where higher {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}0 indicates regions of epistemic uncertainty—states with a risk of poor novice performance.

  • Learning the Safety Threshold {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}1: Doubt values at human interventions are accumulated in a log {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}2. After all data collection epochs, the safety threshold is set to the average of the top quartile of intervention-time doubts:

{(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}3

This threshold acts as a data-driven risk certificate for the trained novice.

3. Training Loop and Data Aggregation

HG-DAgger employs an iterative learning procedure integrating supervised updates, human gating, uncertainty logging, and risk threshold inference. The core loop is:

  1. Start from an initial behavioral cloning dataset {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}4, and initialize policy {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}5.
  2. For each epoch {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}6 over {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}7 epochs and {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}8 rollouts:
    • At each timestep {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}9:
      • The gating function minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.0 determines whether the expert or novice acts.
      • When expert intervenes (minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.1): execute expert action, augment dataset and log current doubt in minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.2.
      • Otherwise, apply novice action.
    • Retrain the novice on the enlarged dataset.
  3. After training, compute minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.3 from minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.4.

This procedure preserves uninterrupted human control during interventions, capturing high-integrity expert labels and enabling the system to learn both safe state distributions and actionable risk certificates.

4. Novice Policy and Uncertainty Estimation Architecture

The novice minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.5 is instantiated as an ensemble of feed-forward neural networks. Each subnetwork processes the concatenated observation vector

minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.6

where the components represent lateral/heading/yaw, lane distances, and obstacle proximities. The architecture comprises two hidden layers (128–256 units, ReLU activation), with a continuous action output of (steering, speed). The ensemble mean prescribes the novice action; ensemble covariance estimates the epistemic uncertainty, closely approximating a scalable Gaussian process for risk estimation.

5. Experimental Setup and Evaluation Metrics

HG-DAgger was evaluated on both simulated and real-world autonomous driving tasks:

  • Simulation: Two-lane road with static obstacle cars; novice navigates randomly lane-blocked sequences.
  • Real vehicle: MG-GS car equipped with LiDAR, precision localization, onboard safety driver, and an off-board human expert for HG-DAgger interventions.

Baselines included:

  • Behavioral Cloning (BC) with minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.7 demonstration labels.
  • Standard DAgger with minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.8, decayed by minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.9 per epoch.

Metrics:

  • Collision rate (collisions per meter traveled)
  • Road departure rate and mean duration
  • Steering-angle distribution this relative to human reference (Bhattacharyya distance)

6. Empirical Results and Analysis

In simulation, after a total of πN\pi_N0 sampled states (BC + DAgger/HG-DAgger), the following rates were observed:

Method Road Departure (m⁻¹) Collision (m⁻¹) Bhattacharyya Distance
BC πN\pi_N1 πN\pi_N2 0.1173
DAgger πN\pi_N3 πN\pi_N4 0.1057
HG-DAgger πN\pi_N5 πN\pi_N6 0.0834

DAgger suffered from late-epoch instability, plausibly due to degraded label quality under stochastic action gating, whereas HG-DAgger achieved more stable learning. Risk-threshold validation—partitioning states into estimated safe πN\pi_N7 and unsafe regions—demonstrated that πN\pi_N8 robustly separates high- and low-risk outcomes: inside πN\pi_N9, collision/road-departure rates were πH\pi_H0; outside, rates rose to πH\pi_H1 and πH\pi_H2 per meter, respectively.

On-vehicle, HG-DAgger achieved zero collisions and zero departures, outperforming both BC and DAgger.

7. Discussion, Limitations, and Prospective Directions

HG-DAgger presents a practical solution for real-world imitation learning with human experts. By preserving the expert’s ability to intervene at will, the algorithm assures higher-quality expert demonstrations and safer data collection. The risk threshold πH\pi_H3, learned directly from intervention-time uncertainty, operates as an actionable certificate for downstream safety filtering.

Limitations include reliance on the expert’s real-time discrimination of unsafe states and absence of formal regret or safety proofs. The ensemble-based uncertainty metric, while tractable, may undersample epistemic uncertainty. Prospective directions include automating the gating process using the learned πH\pi_H4 rule, employing richer uncertainty estimators such as Bayesian neural networks or MC-dropout, and extending the methodology to multi-modal or formally verifiable safety-critical domains (Kelly et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HG-DAgger.