Hierarchical Deep Q-Network from Imperfect Demonstrations in Minecraft (1912.08664v4)

Published 18 Dec 2019 in cs.AI

Abstract: We present Hierarchical Deep Q-Network (HDQfD) that took first place in the MineRL competition. HDQfD works on imperfect demonstrations and utilizes the hierarchical structure of expert trajectories. We introduce the procedure of extracting an effective sequence of meta-actions and subgoals from demonstration data. We present a structured task-dependent replay buffer and adaptive prioritizing technique that allow the HDQfD agent to gradually erase poor-quality expert data from the buffer. In this paper, we present the details of the HDQfD algorithm and give the experimental results in the Minecraft domain.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the HDQfD algorithm, which decomposes complex Minecraft tasks using a hierarchical structure of expert trajectories.
It employs an adaptive replay buffer and demonstration discretization to efficiently filter and utilize imperfect expert data.
Experimental results in MineRL show that HDQfD significantly outperforms baselines like DQfD and PPO, achieving a score of 61.61.

Hierarchical Deep Q-Network from Imperfect Demonstrations in Minecraft

The paper "Hierarchical Deep Q-Network from Imperfect Demonstrations in Minecraft" introduces the Hierarchical Deep Q-Network from Demonstrations (HDQfD) algorithm, which secured first place in the MineRL competition. This paper focuses on reinforcement learning (RL) from demonstrations, particularly in complex environments with inherent challenges like Minecraft. Due to the difficulties in acquiring high-quality expert demonstrations in real-world, sample-limited domains, the authors developed a system that capitalizes on imperfect demonstrations and the hierarchical structure of expert trajectories.

Algorithmic Innovation

HDQfD distinguishes itself by effectively working with imperfect demonstrations through a few core contributions:

Hierarchical Structure Utilization: The algorithm exploits the hierarchical nature of expert trajectories by extracting sequences of meta-actions and subgoals. This approach ensures that each subtask within a trajectory is resolved with its own strategy.
Adaptive Replay Buffer: HDQfD employs a structured, task-dependent replay buffer and an adaptive prioritizing technique to manage demonstrations. This buffer enables the agent to gradually discard low-quality expert data, prioritizing better quality demonstrations for learning.
Demonstration Data Management: The paper introduces a method for discretizing demonstrations to make them more suitable for RL agents. Additionally, a frameskip and framestack process enhances the demonstration data for better agent learning.
Training with Imperfect Data: The HDQfD modifies the ratio of expert data used during training, allowing for an adaptive approach that decreases the reliance on imperfect demonstrations as the learning progresses.

Experimental Validation

The HDQfD algorithm was validated in the Minecraft domain, a particularly challenging environment due to its 3D, first-person, open-world nature. The results demonstrate that HDQfD can efficiently learn from imperfect expert trajectories and significantly improve in-game performance over time.

In the MineRL competition, several submissions were tested, showcasing HDQfD's performance under various configurations. Notably, the HDQfD reached a score of 61.61 in the evaluation phase, outperforming baseline models such as DQfD and PPO by a significant margin.

Implications

The advancements presented in HDQfD have significant implications for reinforcement learning, particularly in environments where demonstrations are imperfect or sparse. This work highlights the potential of leveraging hierarchical structures within demonstrations and adapting demonstration ratios for enhancing training efficacy.

The authors propose that future work should focus on extending their hierarchical approach to a full end-to-end architecture. Additionally, they suggest augmenting agents by allowing access to all subtask demonstrations, respecting the inventory state, to further improve performance.

Conclusion

This paper provides a detailed exploration of RL from imperfect demonstrations, introducing a hierarchical approach that enhances the ability to manage and learn from complex trajectories. The success in the MineRL competition underscores the potential of HDQfD to address real-world challenges in hierarchical and sparse environments, marking a step forward in reinforcement learning strategies.

PDF Markdown

Related Papers

YouTube

Show All Videos