HG-DAgger: Human-Gated Imitation Learning
- HG-DAgger is an interactive imitation learning algorithm that integrates human interventions with uncertainty-aware risk estimation in autonomous driving.
- It employs a real-time gating mechanism and an ensemble-based uncertainty metric to selectively balance expert and novice actions during training.
- Empirical evaluations show HG-DAgger reduces collision and road departure rates, offering improved safety and stability compared to BC and standard DAgger.
HG-DAgger is an interactive imitation learning algorithm specifically designed to accommodate human experts in real-world systems. It addresses inherent deficiencies in classical behavioral cloning (BC) and Dataset Aggregation (DAgger), especially in the context of high-stakes domains like autonomous driving, where compounding errors, safety, and human label quality are central concerns. HG-DAgger introduces a principled, human-gated control interface and uncertainty-aware risk estimation, yielding superior safety, sample efficiency, and human-likeness in learned policies (Kelly et al., 2018).
1. Background: Imitation Learning and DAgger
Imitation learning seeks a novice policy that closely mimics an expert , usually through supervised learning on expert demonstrations :
Behavioral cloning trains solely on states encountered by , causing distributional shift— may experience unfamiliar states at test time, leading to compounding errors. DAgger counters this by stochastically interleaving expert and novice actions during data collection, with a Bernoulli gate (probability for expert), aggregating states from both policies and thereby reducing the mismatch.
However, traditional DAgger exposes two limitations in human-in-the-loop scenarios: (1) the expert must provide corrective actions in real time without full system control, which degrades safety and label quality; (2) actuator lag exacerbates issues in dynamic settings.
2. Algorithmic Structure of HG-DAgger
HG-DAgger modifies DAgger to ensure that the human expert can gate control at will, fully intervening in real time when the novice policy enters perceived unsafe regions. The components are:
- Gating Rule: The human defines on-the-fly a permitted set of states ; the gating function operates as
where 0 denotes expert control, 1 denotes novice control.
- Policy Rollout: The joint control policy during data gathering is
2
where 3 is the observation available to the novice.
- Uncertainty (‘Doubt’) Metric: The novice 4 is an ensemble of 5 neural networks, each producing action 6 for input 7. Empirical covariance is computed as
8
The scalar doubt is defined as
9
where higher 0 indicates regions of epistemic uncertainty—states with a risk of poor novice performance.
- Learning the Safety Threshold 1: Doubt values at human interventions are accumulated in a log 2. After all data collection epochs, the safety threshold is set to the average of the top quartile of intervention-time doubts:
3
This threshold acts as a data-driven risk certificate for the trained novice.
3. Training Loop and Data Aggregation
HG-DAgger employs an iterative learning procedure integrating supervised updates, human gating, uncertainty logging, and risk threshold inference. The core loop is:
- Start from an initial behavioral cloning dataset 4, and initialize policy 5.
- For each epoch 6 over 7 epochs and 8 rollouts:
- At each timestep 9:
- The gating function 0 determines whether the expert or novice acts.
- When expert intervenes (1): execute expert action, augment dataset and log current doubt in 2.
- Otherwise, apply novice action.
- Retrain the novice on the enlarged dataset.
- At each timestep 9:
- After training, compute 3 from 4.
This procedure preserves uninterrupted human control during interventions, capturing high-integrity expert labels and enabling the system to learn both safe state distributions and actionable risk certificates.
4. Novice Policy and Uncertainty Estimation Architecture
The novice 5 is instantiated as an ensemble of feed-forward neural networks. Each subnetwork processes the concatenated observation vector
6
where the components represent lateral/heading/yaw, lane distances, and obstacle proximities. The architecture comprises two hidden layers (128–256 units, ReLU activation), with a continuous action output of (steering, speed). The ensemble mean prescribes the novice action; ensemble covariance estimates the epistemic uncertainty, closely approximating a scalable Gaussian process for risk estimation.
5. Experimental Setup and Evaluation Metrics
HG-DAgger was evaluated on both simulated and real-world autonomous driving tasks:
- Simulation: Two-lane road with static obstacle cars; novice navigates randomly lane-blocked sequences.
- Real vehicle: MG-GS car equipped with LiDAR, precision localization, onboard safety driver, and an off-board human expert for HG-DAgger interventions.
Baselines included:
- Behavioral Cloning (BC) with 7 demonstration labels.
- Standard DAgger with 8, decayed by 9 per epoch.
Metrics:
- Collision rate (collisions per meter traveled)
- Road departure rate and mean duration
- Steering-angle distribution this relative to human reference (Bhattacharyya distance)
6. Empirical Results and Analysis
In simulation, after a total of 0 sampled states (BC + DAgger/HG-DAgger), the following rates were observed:
| Method | Road Departure (m⁻¹) | Collision (m⁻¹) | Bhattacharyya Distance |
|---|---|---|---|
| BC | 1 | 2 | 0.1173 |
| DAgger | 3 | 4 | 0.1057 |
| HG-DAgger | 5 | 6 | 0.0834 |
DAgger suffered from late-epoch instability, plausibly due to degraded label quality under stochastic action gating, whereas HG-DAgger achieved more stable learning. Risk-threshold validation—partitioning states into estimated safe 7 and unsafe regions—demonstrated that 8 robustly separates high- and low-risk outcomes: inside 9, collision/road-departure rates were 0; outside, rates rose to 1 and 2 per meter, respectively.
On-vehicle, HG-DAgger achieved zero collisions and zero departures, outperforming both BC and DAgger.
7. Discussion, Limitations, and Prospective Directions
HG-DAgger presents a practical solution for real-world imitation learning with human experts. By preserving the expert’s ability to intervene at will, the algorithm assures higher-quality expert demonstrations and safer data collection. The risk threshold 3, learned directly from intervention-time uncertainty, operates as an actionable certificate for downstream safety filtering.
Limitations include reliance on the expert’s real-time discrimination of unsafe states and absence of formal regret or safety proofs. The ensemble-based uncertainty metric, while tractable, may undersample epistemic uncertainty. Prospective directions include automating the gating process using the learned 4 rule, employing richer uncertainty estimators such as Bayesian neural networks or MC-dropout, and extending the methodology to multi-modal or formally verifiable safety-critical domains (Kelly et al., 2018).