Go-Explore Framework in RL

Updated 1 July 2025

Go-Explore is a reinforcement learning framework that archives visited states and revisits promising cells to overcome sparse-reward challenges.
It employs a two-phase approach: deterministic exploration followed by robustification via imitation learning to stabilize performance in stochastic settings.
Extensions of the framework have broadened its impact to robotics, safety validation, and automated testing, consistently setting new benchmark records.

The Go-Explore framework is a family of algorithms in reinforcement learning (RL) that addresses the challenge of efficiently exploring environments with sparse or deceptive rewards, which historically have proven to be among the most difficult problems in RL. Go-Explore introduced systematic mechanisms for remembering, revisiting, and exploring from diverse, promising states and established new records across a range of hard-exploration tasks, notably in Atari benchmarks like Montezuma’s Revenge and Pitfall. The framework has inspired numerous extensions and variants that address its limitations, improve its state abstraction mechanisms, and broaden its applicability to domains such as robotics, safety validation, affective agent modeling, large-scale automated testing, and generalization across tasks.

1. Foundational Concepts and Algorithm Design

Go-Explore rests on three interlinked principles:

Explicit memory of visited states ("cells"): An archive stores a diverse set of states—often reduced to compressed representations—along with the trajectories that reach them. This directly mitigates detachment, a failure mode in which agents forget promising states.
Deterministic return to promising states: Prior to exploration, the agent deterministically returns (via simulator reset or a learned policy) to a selected promising state before continuing to explore, thereby combating derailment (the failure to reliably revisit deep or complex states).
Separation of exploration and robustification: Exploration is conducted in an environment where determinism (instantaneous reset and action replay) is leveraged to its fullest. Once successful ("brittle") trajectories are found, a robustification phase uses imitation learning (notably the Backward Algorithm) to train a stochastic, robust policy that generalizes under realistic environmental stochasticity.

A typical Go-Explore workflow proceeds as follows:

Archive = {initial_state}
while not task_solved:
    c = select_cell(Archive)  # Heuristic prioritization
    restore_state(c.state)
    t = explore_from_cell(c)
    for new_cell in t.new_cells:
        if new_cell not in Archive or t.score > Archive[new_cell].score:
            Archive[new_cell] = (t.state, t.trajectory, t.score)
for demo in sampled_trajectories(Archive):
    train policy π via imitation learning (e.g., Backward Algorithm)

Cell selection incorporates heuristics designed to prioritize under-explored or frontier regions, with formulas such as:

$CellScore(c) = LevelWeight(c) \cdot \left[\left(\sum_{n} NeighScore(c, n)\right) + \left(\sum_a CntScore(c, a)\right) + 1\right]$

These mechanisms enable systematic and repeatable traversal of very large state spaces in challenging, sparse-reward settings (Ecoffet et al., 2019, Ecoffet et al., 2020).

2. State Representation and Variants

The definition and abstraction of "cells" is central to Go-Explore's success and flexibility. Several strategies have evolved:

Pixel Downsampling: Aggressively downsamples high-dimensional observations (e.g., 11×8 grayscale images) for cell encoding. This is domain-agnostic but potentially lossy.
Domain Knowledge Features: Employs concise, human-understandable descriptors such as agent $x, y$ position, room number, inventory, or game level (e.g., Montezuma’s Revenge and Pitfall). This variant significantly increases sample efficiency and performance by leveraging simple heuristics (Ecoffet et al., 2019).
Latent Representation Learning: “Cell-free” approaches such as Latent Go-Explore (LGE) use encoders (e.g., inverse/forward dynamics, VQ-VAE) to map observations into a learned latent space, with density estimation guiding exploration toward low-density regions (Gallouédec et al., 2022).
Time-Myopic Encodings: A learned, temporally aware representation maps observations into an embedding space where proximity reflects temporal closeness, and novelty is based on predicted time distances (Höftmann et al., 2023).

These representations determine Go-Explore’s generality and robustness, especially for real-world, high-dimensional, or partially observable settings.

3. Workflow Phases and Robustification

The core Go-Explore workflow is distinguished by its two-phase structure:

Phase 1 – Exploration: Leveraging deterministic resets or goal-conditioned policies, the agent systematically revisits and explores from archived states, persistently extending its coverage of the state space.

Phase 2 – Robustification: High-performing, potentially brittle trajectories are converted into robust policies using imitation learning, commonly via the Backward Algorithm. This phase is essential for generalization to stochastic or real-world conditions.

The robustification process is particularly salient in safety-critical verification, such as adaptive stress testing of autonomous systems, where the Backward Algorithm both improves failure-seeking trajectories and their likelihood (Koren et al., 2020).

4. Performance, Benchmarks, and Impact

Go-Explore established new benchmarks on classic sparse-reward Atari domains:

Algorithm	Montezuma's Revenge	Pitfall
Human Expert	34,900	47,821
Prior SOTA (IM methods)	~11,500	<0
Go-Explore (no domain)	43,763	N/A
Go-Explore (domain)	666,474	59,494
Go-Explore (best)	18,003,200	107,363

Go-Explore, with domain knowledge, not only surpassed all algorithmic baselines but, in Montezuma’s Revenge, exceeded established human world records by more than an order of magnitude (Ecoffet et al., 2019, Ecoffet et al., 2020).

In robotics (e.g., pick-and-place tasks with sparse rewards), Go-Explore consistently discovered successful strategies that PPO and intrinsic motivation baselines could not, even with dramatically fewer frames (Ecoffet et al., 2020).

Empirical results in other domains include:

Automated game testing: Go-Explore achieved near-complete coverage of vast, open-world maps—surfacing both expected and unexpected reachability bugs, and outperforming curiosity-based RL approaches by several orders of magnitude in coverage rate (Lu et al., 2022).
Residential energy management: Cell-based archival and systematic exploration yielded up to 19.84% improvement in cost savings over Deep Q-Network baselines (Lu et al., 15 Jan 2024).
Affective agent modeling: Modified Go-Explore agents could blend behavioral and affective objectives (e.g., score and human arousal imitation), generating believable agent play styles for game testing (Barthet et al., 2021).

5. Extensions and Research Trajectories

Building on the original Go-Explore paradigm, multiple extensions have furthered its capabilities:

Policy-based Go-Explore: Where deterministic reset is impossible, a goal-conditioned policy returns to archived cells. Policies are trained jointly with the agent and robust to environment stochasticity (Ecoffet et al., 2020).
Post-Exploration: Systematic additional exploration after reaching a goal yields improved state-space coverage, with adaptive strategies that trigger post-exploration based on frontier novelty or adaptive duration. This has been shown, in both MiniGrid and Mujoco environments, to enhance diversity and performance more than conventional exploration tuning (Yang et al., 2022, Yang et al., 2022).
Cell-Free Variants (LGE): Generalization to raw observation spaces via learned representations avoids the brittleness and manual tuning of hand-crafted cell abstractions, enabling broader applicability and robust exploration, especially in continuous or high-dimensional domains (Gallouédec et al., 2022).
Intelligent Go-Explore (IGE): Recent advancements integrate foundation models such as LLMs in place of hand-crafted heuristics at all stages—selecting, filtering, and exploring archived states. This approach achieves significant sample efficiency and generalization in language- and vision-based reasoning benchmarks, surpassing both prior RL and LLM-based agents (Lu et al., 24 May 2024).
Generalization via Explore-Go: While Go-Explore excels at systematic exploration for reward-sparse tasks, Explore-Go applies a preparatory exploration phase at the start of each training episode to diversify the agent's start state, improving out-of-distribution and unreachable-task generalization in contextual MDPs and Procgen tasks (Weltevrede et al., 12 Jun 2024).

6. Applications Across Domains

Go-Explore and its derivatives have been applied to a variety of tasks:

Hard-exploration games: Montezuma's Revenge, Pitfall, and other sparse-reward challenges in the Atari suite, setting "superhuman" records for both mean and best scores.
Robotics: Multi-step manipulation (pick-and-place) requiring structured exploration and memory.
Simulation-based safety verification: Discovery of rare failure cases in complex, black-box environments, such as adaptive stress testing of autonomous vehicles, without domain heuristics (Koren et al., 2020).
Game QA: Large-scale, automated reachability testing in open-world 3D game environments, surfacing critical bugs missed by human testers (Lu et al., 2022).
Multi-objective RL: Simultaneous optimization for performance and human-like affect or other behavioral objectives (Barthet et al., 2021).
Energy management: Optimal scheduling for cost-saving under deceptive, sparse, and stochastic reward signals (Lu et al., 15 Jan 2024).
Language and reasoning tasks: Efficient, generalist exploration by combining archive-based search with foundation model intelligence, yielding sample-efficient trajectories and solution discovery (Lu et al., 24 May 2024).

7. Limitations, Challenges, and Future Prospects

Despite their significant achievements, Go-Explore methods face certain limitations:

Reset-to-arbitrary-state requirement: The classic approach requires the simulator support resets to arbitrary states, which may not be possible in all real-world settings. Policy-based variants address this with goal-conditioned policies, but with increased training complexity.
Cell representation design: Poorly designed cell abstractions can cause degeneration or stagnation in exploration. Recent research addresses this with learned latent (Gallouédec et al., 2022) or temporally-informed (Höftmann et al., 2023) representations.
Scalability and generalization: Experience indicates strong performance in complex, simulated environments. Further work is required for scaling to Lifelong/real-world/autonomous agent settings.
Archive management: In high-dimensional or continuous domains, archive size and redundancy must be managed effectively, especially as the number of stored states grows.

Future research directions highlighted in foundational and recent works include:

Integrating improved learned representation schemes for robust, cell-free exploration in high-dimensional observations.
Automating all exploration heuristics by leveraging large foundation models and retrieval-augmented generation for archive management and "interestingness" judgments (Lu et al., 24 May 2024).
Developing robustification techniques suitable for transfer learning, multi-agent scenarios, and online adaptation in stochastic or open-ended environments.
Extending principles to multi-objective RL, affect modeling, safety assurance, and open-ended generative modeling.
Generalizing post-exploration and archive-driven paradigms for rapid adaptation and exploration across new tasks and domains, including zero-shot and out-of-distribution generalization.

Go-Explore, through continuous development and hybridization with foundation models, continues to represent a central paradigm for addressing the exploration-exploitation dilemma in both deep RL research and real-world problem domains.