HG-DAgger: Human-Gated Imitation Learning
- HG-DAgger is an interactive imitation learning algorithm that integrates human interventions with uncertainty-aware risk estimation in autonomous driving.
- It employs a real-time gating mechanism and an ensemble-based uncertainty metric to selectively balance expert and novice actions during training.
- Empirical evaluations show HG-DAgger reduces collision and road departure rates, offering improved safety and stability compared to BC and standard DAgger.
HG-DAgger is an interactive imitation learning algorithm specifically designed to accommodate human experts in real-world systems. It addresses inherent deficiencies in classical behavioral cloning (BC) and Dataset Aggregation (DAgger), especially in the context of high-stakes domains like autonomous driving, where compounding errors, safety, and human label quality are central concerns. HG-DAgger introduces a principled, human-gated control interface and uncertainty-aware risk estimation, yielding superior safety, sample efficiency, and human-likeness in learned policies (Kelly et al., 2018).
1. Background: Imitation Learning and DAgger
Imitation learning seeks a novice policy that closely mimics an expert , usually through supervised learning on expert demonstrations :
Behavioral cloning trains solely on states encountered by , causing distributional shift— may experience unfamiliar states at test time, leading to compounding errors. DAgger counters this by stochastically interleaving expert and novice actions during data collection, with a Bernoulli gate (probability for expert), aggregating states from both policies and thereby reducing the mismatch.
However, traditional DAgger exposes two limitations in human-in-the-loop scenarios: (1) the expert must provide corrective actions in real time without full system control, which degrades safety and label quality; (2) actuator lag exacerbates issues in dynamic settings.
2. Algorithmic Structure of HG-DAgger
HG-DAgger modifies DAgger to ensure that the human expert can gate control at will, fully intervening in real time when the novice policy enters perceived unsafe regions. The components are:
- Gating Rule: The human defines on-the-fly a permitted set of states ; the gating function operates as
where denotes expert control, denotes novice control.
- Policy Rollout: The joint control policy during data gathering is
where is the observation available to the novice.
- Uncertainty (‘Doubt’) Metric: The novice is an ensemble of neural networks, each producing action for input . Empirical covariance is computed as
The scalar doubt is defined as
where higher indicates regions of epistemic uncertainty—states with a risk of poor novice performance.
- Learning the Safety Threshold : Doubt values at human interventions are accumulated in a log . After all data collection epochs, the safety threshold is set to the average of the top quartile of intervention-time doubts:
This threshold acts as a data-driven risk certificate for the trained novice.
3. Training Loop and Data Aggregation
HG-DAgger employs an iterative learning procedure integrating supervised updates, human gating, uncertainty logging, and risk threshold inference. The core loop is:
- Start from an initial behavioral cloning dataset , and initialize policy .
- For each epoch over epochs and rollouts:
- At each timestep :
- The gating function determines whether the expert or novice acts.
- When expert intervenes (): execute expert action, augment dataset and log current doubt in .
- Otherwise, apply novice action.
- Retrain the novice on the enlarged dataset.
- At each timestep :
- After training, compute from .
This procedure preserves uninterrupted human control during interventions, capturing high-integrity expert labels and enabling the system to learn both safe state distributions and actionable risk certificates.
4. Novice Policy and Uncertainty Estimation Architecture
The novice is instantiated as an ensemble of feed-forward neural networks. Each subnetwork processes the concatenated observation vector
where the components represent lateral/heading/yaw, lane distances, and obstacle proximities. The architecture comprises two hidden layers (128–256 units, ReLU activation), with a continuous action output of (steering, speed). The ensemble mean prescribes the novice action; ensemble covariance estimates the epistemic uncertainty, closely approximating a scalable Gaussian process for risk estimation.
5. Experimental Setup and Evaluation Metrics
HG-DAgger was evaluated on both simulated and real-world autonomous driving tasks:
- Simulation: Two-lane road with static obstacle cars; novice navigates randomly lane-blocked sequences.
- Real vehicle: MG-GS car equipped with LiDAR, precision localization, onboard safety driver, and an off-board human expert for HG-DAgger interventions.
Baselines included:
- Behavioral Cloning (BC) with demonstration labels.
- Standard DAgger with , decayed by $0.85$ per epoch.
Metrics:
- Collision rate (collisions per meter traveled)
- Road departure rate and mean duration
- Steering-angle distribution this relative to human reference (Bhattacharyya distance)
6. Empirical Results and Analysis
In simulation, after a total of sampled states (BC + DAgger/HG-DAgger), the following rates were observed:
| Method | Road Departure (m⁻¹) | Collision (m⁻¹) | Bhattacharyya Distance |
|---|---|---|---|
| BC | 0.1173 | ||
| DAgger | 0.1057 | ||
| HG-DAgger | 0.0834 |
DAgger suffered from late-epoch instability, plausibly due to degraded label quality under stochastic action gating, whereas HG-DAgger achieved more stable learning. Risk-threshold validation—partitioning states into estimated safe and unsafe regions—demonstrated that robustly separates high- and low-risk outcomes: inside , collision/road-departure rates were ; outside, rates rose to and per meter, respectively.
On-vehicle, HG-DAgger achieved zero collisions and zero departures, outperforming both BC and DAgger.
7. Discussion, Limitations, and Prospective Directions
HG-DAgger presents a practical solution for real-world imitation learning with human experts. By preserving the expert’s ability to intervene at will, the algorithm assures higher-quality expert demonstrations and safer data collection. The risk threshold , learned directly from intervention-time uncertainty, operates as an actionable certificate for downstream safety filtering.
Limitations include reliance on the expert’s real-time discrimination of unsafe states and absence of formal regret or safety proofs. The ensemble-based uncertainty metric, while tractable, may undersample epistemic uncertainty. Prospective directions include automating the gating process using the learned rule, employing richer uncertainty estimators such as Bayesian neural networks or MC-dropout, and extending the methodology to multi-modal or formally verifiable safety-critical domains (Kelly et al., 2018).