Papers
Topics
Authors
Recent
2000 character limit reached

HG-DAgger: Human-Gated Imitation Learning

Updated 8 January 2026
  • HG-DAgger is an interactive imitation learning algorithm that integrates human interventions with uncertainty-aware risk estimation in autonomous driving.
  • It employs a real-time gating mechanism and an ensemble-based uncertainty metric to selectively balance expert and novice actions during training.
  • Empirical evaluations show HG-DAgger reduces collision and road departure rates, offering improved safety and stability compared to BC and standard DAgger.

HG-DAgger is an interactive imitation learning algorithm specifically designed to accommodate human experts in real-world systems. It addresses inherent deficiencies in classical behavioral cloning (BC) and Dataset Aggregation (DAgger), especially in the context of high-stakes domains like autonomous driving, where compounding errors, safety, and human label quality are central concerns. HG-DAgger introduces a principled, human-gated control interface and uncertainty-aware risk estimation, yielding superior safety, sample efficiency, and human-likeness in learned policies (Kelly et al., 2018).

1. Background: Imitation Learning and DAgger

Imitation learning seeks a novice policy πN\pi_N that closely mimics an expert πH\pi_H, usually through supervised learning on expert demonstrations {(x,aH=πH(x))}\{(x,a_H = \pi_H(x))\}:

minθ(x,aH)DBCπNθ(x)aH2.\min_{\theta}\sum_{(x,a_H)\in\mathcal D_{\rm BC}} \|\pi_N^\theta(x) - a_H\|^2.

Behavioral cloning trains πN\pi_N solely on states encountered by πH\pi_H, causing distributional shift—πN\pi_N may experience unfamiliar states at test time, leading to compounding errors. DAgger counters this by stochastically interleaving expert and novice actions during data collection, with a Bernoulli gate (probability β\beta for expert), aggregating states from both policies and thereby reducing the mismatch.

However, traditional DAgger exposes two limitations in human-in-the-loop scenarios: (1) the expert must provide corrective actions in real time without full system control, which degrades safety and label quality; (2) actuator lag exacerbates issues in dynamic settings.

2. Algorithmic Structure of HG-DAgger

HG-DAgger modifies DAgger to ensure that the human expert can gate control at will, fully intervening in real time when the novice policy enters perceived unsafe regions. The components are:

  • Gating Rule: The human defines on-the-fly a permitted set of states PX\mathcal P \subseteq \mathcal X; the gating function operates as

g(xt)={1,xtP 0,xtPg(x_t) = \begin{cases} 1, & x_t \notin \mathcal P \ 0, & x_t \in \mathcal P \end{cases}

where g(xt)=1g(x_t) = 1 denotes expert control, g(xt)=0g(x_t) = 0 denotes novice control.

  • Policy Rollout: The joint control policy during data gathering is

πi(xt)=g(xt)πH(xt)+[1g(xt)]πNi(ot),\pi_i(x_t) = g(x_t)\pi_H(x_t) + [1-g(x_t)]\pi_{N_i}(o_t),

where ot=O(xt)o_t = \mathcal O(x_t) is the observation available to the novice.

  • Uncertainty (‘Doubt’) Metric: The novice πN\pi_N is an ensemble of MM neural networks, each producing action at(k)a_t^{(k)} for input oto_t. Empirical covariance is computed as

Ct=1M1k=1M(at(k)aˉt)(at(k)aˉt),aˉt=1Mkat(k).C_t = \frac{1}{M-1}\sum_{k=1}^M (a_t^{(k)} - \bar a_t)(a_t^{(k)} - \bar a_t)^\top,\quad \bar a_t = \frac{1}{M}\sum_k a_t^{(k)}.

The scalar doubt is defined as

dN(ot)=diag(Ct)2,d_N(o_t) = \|\mathrm{diag}(C_t)\|_2,

where higher dNd_N indicates regions of epistemic uncertainty—states with a risk of poor novice performance.

  • Learning the Safety Threshold τ\tau: Doubt values at human interventions are accumulated in a log I\mathcal I. After all data collection epochs, the safety threshold is set to the average of the top quartile of intervention-time doubts:

τ=1N/4i=0.75NNI[i].\tau = \frac{1}{N/4} \sum_{i = \lfloor 0.75N \rfloor}^{N} \mathcal I[i].

This threshold acts as a data-driven risk certificate for the trained novice.

3. Training Loop and Data Aggregation

HG-DAgger employs an iterative learning procedure integrating supervised updates, human gating, uncertainty logging, and risk threshold inference. The core loop is:

  1. Start from an initial behavioral cloning dataset DBC\mathcal D_{\rm BC}, and initialize policy πN1\pi_{N_1}.
  2. For each epoch ii over KK epochs and MM rollouts:
    • At each timestep tt:
      • The gating function g(xt)g(x_t) determines whether the expert or novice acts.
      • When expert intervenes (g(xt)=1g(x_t)=1): execute expert action, augment dataset and log current doubt in I\mathcal I.
      • Otherwise, apply novice action.
    • Retrain the novice on the enlarged dataset.
  3. After training, compute τ\tau from I\mathcal I.

This procedure preserves uninterrupted human control during interventions, capturing high-integrity expert labels and enabling the system to learn both safe state distributions and actionable risk certificates.

4. Novice Policy and Uncertainty Estimation Architecture

The novice πN\pi_N is instantiated as an ensemble of feed-forward neural networks. Each subnetwork processes the concatenated observation vector

o=[y,θ,s,ll,lr,dl,dr]o = [y, \theta, s, l_l, l_r, d_l, d_r]

where the components represent lateral/heading/yaw, lane distances, and obstacle proximities. The architecture comprises two hidden layers (128–256 units, ReLU activation), with a continuous action output of (steering, speed). The ensemble mean prescribes the novice action; ensemble covariance estimates the epistemic uncertainty, closely approximating a scalable Gaussian process for risk estimation.

5. Experimental Setup and Evaluation Metrics

HG-DAgger was evaluated on both simulated and real-world autonomous driving tasks:

  • Simulation: Two-lane road with static obstacle cars; novice navigates randomly lane-blocked sequences.
  • Real vehicle: MG-GS car equipped with LiDAR, precision localization, onboard safety driver, and an off-board human expert for HG-DAgger interventions.

Baselines included:

  • Behavioral Cloning (BC) with N0=104N_0=10^4 demonstration labels.
  • Standard DAgger with β0=0.85\beta_0=0.85, decayed by $0.85$ per epoch.

Metrics:

  • Collision rate (collisions per meter traveled)
  • Road departure rate and mean duration
  • Steering-angle distribution this relative to human reference (Bhattacharyya distance)

6. Empirical Results and Analysis

In simulation, after a total of 2×1042 \times 10^4 sampled states (BC + DAgger/HG-DAgger), the following rates were observed:

Method Road Departure (m⁻¹) Collision (m⁻¹) Bhattacharyya Distance
BC 1.2×103\approx 1.2\times10^{-3} 0.9×103\approx 0.9\times10^{-3} 0.1173
DAgger 1.5×103\approx 1.5\times10^{-3} 1.2×103\approx 1.2\times10^{-3} 0.1057
HG-DAgger 0.5×103\approx 0.5\times10^{-3} 0.3×103\approx 0.3\times10^{-3} 0.0834

DAgger suffered from late-epoch instability, plausibly due to degraded label quality under stochastic action gating, whereas HG-DAgger achieved more stable learning. Risk-threshold validation—partitioning states into estimated safe P^={x:dN(o)τ}\hat{\mathcal P} = \{x: d_N(o) \leq \tau\} and unsafe regions—demonstrated that τ\tau robustly separates high- and low-risk outcomes: inside P^\hat{\mathcal P}, collision/road-departure rates were 0.607×1030.607\times 10^{-3}; outside, rates rose to 7.53×1037.53\times 10^{-3} and 12.09×10312.09\times 10^{-3} per meter, respectively.

On-vehicle, HG-DAgger achieved zero collisions and zero departures, outperforming both BC and DAgger.

7. Discussion, Limitations, and Prospective Directions

HG-DAgger presents a practical solution for real-world imitation learning with human experts. By preserving the expert’s ability to intervene at will, the algorithm assures higher-quality expert demonstrations and safer data collection. The risk threshold τ\tau, learned directly from intervention-time uncertainty, operates as an actionable certificate for downstream safety filtering.

Limitations include reliance on the expert’s real-time discrimination of unsafe states and absence of formal regret or safety proofs. The ensemble-based uncertainty metric, while tractable, may undersample epistemic uncertainty. Prospective directions include automating the gating process using the learned dN(o)>τd_N(o) > \tau rule, employing richer uncertainty estimators such as Bayesian neural networks or MC-dropout, and extending the methodology to multi-modal or formally verifiable safety-critical domains (Kelly et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HG-DAgger.