Go-Explore: a New Approach for Hard-Exploration Problems (1901.10995v4)

Published 30 Jan 2019 in cs.LG, cs.AI, and stat.ML

Abstract: A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma's Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems. On Montezuma's Revenge, Go-Explore scores a mean of over 43k points, almost 4 times the previous state of the art. Go-Explore can also harness human-provided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma's Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of "superhuman" performance. On Pitfall, Go-Explore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).

PDF Abstract

A Formal Overview of "Go-Explore: a New Approach for Hard-Exploration Problems"

Hard-exploration problems in reinforcement learning (RL), characterized by sparse or deceptive reward landscapes, remain a significant challenge despite recent advances in the field. Traditional methods, which often rely on intrinsic motivation (IM), fall short when confronted with environments that require exploring vast state spaces before receiving any positive reinforcement. This paper introduces Go-Explore, an algorithm that proposes a novel framework to address these challenges, focusing on improved exploration efficiency and robustness in learned policies.

Core Methodology of Go-Explore

Go-Explore comprises two main phases: exploration via deterministic environment exploitation and robustification of solutions in a stochastic setting. The algorithm fundamentally challenges current exploration strategies by capitalizing on state-space memorization and strategic, non-explorative returns to promising states. This approach distinguishes Go-Explore in its ability to systematically and thoroughly explore vast environments, including those where rewards are located far apart or behind deceptive optima.

Phase 1: Deterministic Exploration

In the first phase, Go-Explore utilizes a deterministic environment to focus purely on exploring the state space, detached from the concerns of immediate reward maximization. It archives visited states using either domain-specific or simplified representations. By revisiting and exploring from these archived states, Go-Explore circumvents typical issues like catastrophic forgetting and derailment—where stochasticity in policy exploration leads to inefficient or failed rediscovery of critical states. This meticulous process enables Go-Explore to achieve substantial progress in hard-exploration benchmarks such as Montezuma’s Revenge and Pitfall, marking a clear improvement in acquiring exploratory trajectories compared to existing approaches.

Phase 2: Robustification

Upon acquiring promising trajectories, the second phase involves robustifying these trajectories against environmental stochasticity to yield a stable policy. This robustification is achieved via imitation learning, specifically leveraging the Backward Algorithm—a process that iteratively consolidates the agent’s performance from goal states backwards through the trajectory. This phased approach segregates the concerns of exploration and policy generalization, allowing Go-Explore to first solve the deterministic aspects of the problem before addressing the complexities introduced by a stochastic test environment.

Impact on Hard-Exploration Benchmarks

Go-Explore sets notable records in its performance on Atari benchmark games, particularly Montezuma’s Revenge and Pitfall. The reported scores vastly outperform prior state-of-the-art algorithms, showcasing the efficacy of its core principles in environments where traditional RL approaches may stagnate or converge prematurely on suboptimal policies. Remarkably, on Montezuma’s Revenge, Go-Explore achieved a score exceeding 43,000 without domain knowledge, while scores soared beyond 650,000 with domain knowledge features, establishing a benchmark of superhuman performance. Similarly, Go-Explore was the first algorithm to score positively on Pitfall, further emphasizing the potential of its exploration strategies.

Implications and Future Directions

The methodological advancements introduced by Go-Explore carry significant implications for the development of RL algorithms aimed at solving complex real-world problems. The decoupling of exploration from robustness in policy execution provides a new perspective on tackling environments characterized by high-dimensional state spaces, sparse signals, and the necessity for systematic exploration strategies.

Looking forward, integrating goal-conditioned policies to replace deterministic state resets in Phase 1 could offer solutions for extending Go-Explore to environments lacking explicit simulator control, thereby broadening its applicability. Additionally, enhancing cell representations to exploit learned or compressed state features may further improve the efficiency and scalability of Go-Explore, particularly in environments with extreme dimensional complexity. Robotics, autonomous planning, and large-scale simulation tasks represent natural application domains where Go-Explore’s paradigm of explore-then-robustify could lead to impactful advancements.

In summary, Go-Explore presents a compelling approach in RL research for addressing hard-exploration challenges, and paves the way for future innovations in both algorithmic development and practical applications.