LazyDAgger: Efficient Imitation Learning
- LazyDAgger is an imitation learning algorithm that minimizes expert interventions by selectively querying states with high uncertainty.
- It utilizes dropout-based model ensembles to quantify uncertainty, allowing only the most ambiguous states to trigger expert labeling.
- Variants further optimize human-in-the-loop control by reducing context switches through asymmetric thresholds and noise injection during supervision.
LazyDAgger is an imitation learning (IL) algorithm, designed to efficiently reduce expert intervention queries and minimize supervisor burden during interactive training of autonomous policies. Two principal algorithmic streams carry the LazyDAgger label: one based on disagreement-driven querying for expert labeling (Haridas et al., 2023), and another focused on reducing supervisor context-switching overhead via optimized human-in-the-loop control delegation (Hoque et al., 2021).
1. Overview and Motivation
Standard Dataset Aggregation (DAgger) algorithms require that the expert agent provides corrective actions at every state encountered during policy rollouts, resulting in substantial supervisory cost. LazyDAgger (often referred to as DADAgger in some literature) addresses this issue via selective querying: only acquiring expert actions on states where the learned policy exhibits high uncertainty. Independently, an interactive robot learning variant of LazyDAgger focuses on reducing human supervisor context switches, improving both efficiency and overall task success by lengthening intervention segments and injecting noise during expert control. Both approaches aim to conserve expert effort without sacrificing imitation quality, addressing scalability challenges in real-world or costly environments (Haridas et al., 2023, Hoque et al., 2021).
2. Core Algorithms
2.1 Disagreement-Augmented Dataset Aggregation (DADAgger / LazyDAgger)
LazyDAgger in the DADAgger formulation utilizes dropout-based model ensembles at policy evaluation time to quantify epistemic uncertainty. At each visited state during policy rollout, stochastic forward passes are computed under random dropout, producing actions . The mean action and empirical variance (uncertainty ) are
Only the top of states by are selected for expert querying. Dataset aggregation and policy re-training then proceed as in DAgger, but over this reduced subset (Haridas et al., 2023).
2.2 LazyDAgger for Reducing Context Switching
The interactive robot learning variant modifies SafeDAgger by introducing two mechanisms:
- Asymmetric switching thresholds: Two discrepancy cutoffs, and lower , form a hysteresis band for supervisor/autonomous delegation, reducing frequent context switching.
- Noise injection during supervisor control: When the supervisor acts, actions are drawn from rather than strictly , broadening the state distribution to counteract covariate shift.
A discrepancy classifier predicts the need for intervention. The policy is updated with data acquired predominantly from supervisor-controlled states or when entering/exiting supervisor mode, producing longer, less frequent segments of supervision (Hoque et al., 2021).
3. Theoretical Guarantees and Limitations
No new regret or query-efficiency bounds for LazyDAgger appear in the primary sources. In the DADAgger setting, the standard DAgger no-regret guarantee is recovered when the query fraction . For smaller , performance degrades in proportion to the unqueried OOD states, but this trade-off is not captured by a formal upper bound in the extant literature (Haridas et al., 2023). In the context-switching variant, formal complexity is described as per epoch (where is epochs, horizon, cost per forward pass), with no increase relative to SafeDAgger (Hoque et al., 2021).
A plausible implication is that as the fraction of states selected for querying decreases, the learner may receive a less representative distribution of corrective actions, increasing the risk of accumulating blind spots unless uncertainty estimation is accurate and well-calibrated. Similarly, the effectiveness of reducing context switches relies on accurate estimation and tuning of the hysteresis thresholds and noise parameters.
4. Empirical Evaluation
4.1 DADAgger-style LazyDAgger
Empirical studies in Car Racing and HalfCheetah domains report that LazyDAgger achieves 95–99% of DAgger’s cumulative reward while reducing expert queries by 40–60%. In comparison, random sampling at typically requires more queries to attain the same performance level, underscoring the importance of uncertainty-driven selection. No detailed experiment tables or plots are provided, but trends indicate that increasing the ensemble size sharpens the uncertainty signal at proportional compute cost (Haridas et al., 2023).
4.2 Context-Switch Reduction LazyDAgger
In simulated MuJoCo tasks (HalfCheetah-v2, Walker2D-v2, Ant-v2) with TD3 supervisors, LazyDAgger reduces context switches by 46–79% over SafeDAgger, maintains or improves test-time reward, and matches supervisor action counts. In a physical ABB YuMi robot fabric manipulation task, LazyDAgger reduced context switches by 60% and scored 8/10 task successes (vs. SafeDAgger’s 2/10 and Behavioral Cloning’s 0/10), with only modest growth in supervisor actions reflecting longer intervention segments (Hoque et al., 2021).
| Task | Switch Reduction | Success Rate | Supervisor Actions (LazyDAgger/SafeDAgger) |
|---|---|---|---|
| HalfCheetah | 79% | ≥ SafeD | ≈ |
| Walker2D | 56% | ≥ SafeD | ≈ |
| Ant | 46% | ≥ SafeD | ≈ |
| YuMi Robot | 60% | 8/10 vs 2/10 | 43 vs 34 |
5. Practical Considerations and Extensions
For DADAgger, the number of dropout ensemble members (10–20 typical) balances uncertainty estimation reliability with compute cost. The query threshold can be specified as a percentile () for ease. Computing per-step uncertainty in rollouts scales as forward passes per episode. Extensions include employing alternative uncertainty measures (e.g., entropy), adaptive adjustment of , or adaptation to classification/discrete actions (Haridas et al., 2023).
Noise variance and the hysteresis thresholds must be tuned in the context-switching variant, with practical benefits most compelling when context-switch latency is substantial (e.g., human operation). Potential future work includes online adaptation of thresholds, noise distribution learning, or extensions to multi-agent/fleet settings (Hoque et al., 2021).
6. Implications and Limitations
LazyDAgger provides a direct approach to reducing sample and supervisory costs in imitation learning by focusing effort on policy uncertainty or targeted, longer interventions. Its primary empirical benefit is substantially fewer expert queries or context switches while preserving learning efficacy. However, effectiveness hinges on accurate uncertainty estimation (in DADAgger) or well-tuned switching and noise parameters (in context-switching LazyDAgger). The need for hyperparameter selection, especially in non-simulated or complex environments, remains a challenge. No formal theoretical sample complexity or regret improvements are currently established for these variants. This suggests further work on formalizing performance guarantees and automating practical parameter selection could be significant for broader adoption.
7. Relationship to Related Algorithms
LazyDAgger generalizes and extends standard DAgger [Ross et al., 2011] by introducing selective querying, and modifies SafeDAgger [Zhang et al., 2017] with asymmetric thresholds and noise injection. It is distinguished from random sample selection by leveraging model uncertainty or discrepancy classifiers to focus queries/interventions where they are most impactful. Possible future research directions include blending active learning criteria, adapting methodologies for discrete action spaces, and integrating feedback over episodic horizons.
For comprehensive technical details, see (Haridas et al., 2023) for the DADAgger/LazyDAgger method and (Hoque et al., 2021) for context-switch-minimizing LazyDAgger in interactive robotic imitation learning.