Structured Exploration with Achievements (SEA)
- SEA is a multi-stage reinforcement learning algorithm for achievement-based environments that structures exploration using offline data and a determinant loss.
- It employs goal-conditioned sub-policies and a meta-controller to recover dependency graphs and systematically unlock achievements from sparse rewards.
- Empirical evaluations in Crafter and TreeMaze demonstrate SEA's ability to consistently unlock rare achievements and outperform traditional RL methods.
Structured Exploration with Achievements (SEA) is a multi-stage reinforcement learning (RL) algorithm designed for exploration in environments with a discrete internal achievement set, particularly in the presence of sparse rewards. It leverages offline data to disentangle and structure the achievement space, recovers a dependency graph of achievements, and uses goal-conditioned sub-policies coordinated by a meta-controller to drive systematic exploration and mastery of complex domains with high-dimensional observations (Zhou et al., 2023).
1. Achievement-Based Environments: Formalization
SEA is built for environments formalized as @@@@1@@@@ with Achievements (MDPA), specified by the tuple
where:
- is the observation space (e.g., pixel images),
- is the discrete or continuous action space,
- is the transition kernel,
- denotes the finite set of achievement labels,
- is the achievement-completion function, mapping transitions to unlocked achievements or “none.”
Reward at time is delivered only for the first unlock of each achievement per episode, given by
To restore the Markov property, the environment can be augmented with the set of achievements unlocked so far: .
2. Offline Representation Learning Using Determinant Loss
SEA’s first phase uses offline trajectories to build an achievement-disentangled representation. The goal is to learn a function such that transitions unlocking distinct achievements in a single episode yield maximally separated (near-orthogonal) embeddings.
This is accomplished by the determinantal loss. For each trajectory, collect the time steps with . For pairs :
- Build where ,
- Normalize: ,
- Minimize
for a temperature constant . This encourages the embeddings for different achievements within episodes to be mutually orthogonal, preventing collapse or clustering. The closed-form gradient is given (see original source), facilitating direct optimization (Zhou et al., 2023).
3. Achievement Dependency Graph Extraction
After learning the representation, clusters of achievement transitions (using an achievement classifier ) provide achievement labels for transitions. Exploiting the ordering structure within logged trajectories, SEA recovers a minimal directed acyclic graph (DAG) representing achievement dependencies via a heuristic:
- For each trajectory, count the number of times each achievement occurs, and for each pair , count occurrences where precedes .
- An edge is kept in the initial graph if
where is a tolerance hyperparameter.
- Transitive or redundant edges are pruned: An edge is discarded if its removal does not alter the earliest possible topological position of in the DAG.
- The result is a minimal dependence graph (see extracted pseudocode in original data).
This produces a recovered DAG over achievements, which is then used to structure online exploration.
4. Hierarchical Policy and Meta-Controller
With the structured dependency graph, SEA devises a family of goal-conditioned sub-policies , each tasked with unlocking achievement . Each sub-policy trains by maximizing
where .
A meta-controller topologically traverses the available nodes in the dependency DAG. At the start of each episode and after each achievement is unlocked, it selects the next target achievement whose parent dependencies have all been achieved. The system can choose the next goal randomly or, with a small probability, explore alternative orders to promote broader discovery.
Once a target is chosen, control is routed to the corresponding . Upon success, the meta-controller selects a new subgoal. This hierarchical design enables systematic and data-efficient exploration, especially critical in sparse-reward settings.
5. Empirical Performance in Crafter and TreeMaze
Extensive evaluation was conducted in two procedurally generated domains: Crafter (2D Minecraft-like with 21 achievements) and TreeMaze (minigrid-style, 9–30 achievements, varying sizes) (Zhou et al., 2023). Baselines include IMPALA (vanilla and RND-augmented), PPO, DreamerV2, and HAL.
Key Quantitative Outcomes
SEA demonstrates high clustering accuracy (purity 0.99–1.00), perfectly or near-perfectly recovers achievement dependency graphs in TreeMaze, and achieves a Hamming error of 11/272 edges in full Crafter.
The table below summarizes achievement unlocking rates on Crafter (easy-set: frequent, hard-set: infrequent achievements, last 50M steps):
| Method | Easy-set % (±) | Hard-set % (±) |
|---|---|---|
| IMPALA | 93.9 (0.1) | 0.0 (0.0) |
| IMPALA+RND | 93.2 (1.1) | 0.15 (0.21) |
| PPO | 34.9 (2.2) | 0.0 (0.0) |
| DreamerV2 | 76.2 (11.2) | 0.0 (0.0) |
| HAL | 15.8 (0.8) | 0.0 (0.0) |
| SEA | 92.5 (0.5) | 49.3 (2.9) |
SEA is the only method that consistently unlocks challenging rare achievements (e.g., “collect_diamond” at ∼4% occurrence). The Crafter score, , improves from 39 (IMPALA) up to 75.5 (SEA).
6. Strengths, Limitations, and Prospects
Strengths:
- Unsupervised recovery of invariant achievement structures in procedurally generated environments.
- Determinant-based representation learning yields highly disentangled, clusterable achievement embeddings.
- The dependency-driven meta-controller facilitates directed discovery of complex, hard-to-reach subgoals.
- Empirically, SEA is the first approach to reach the most challenging achievements in Crafter.
Limitations:
- Necessitates an initial exploration dataset containing most achievements.
- Assumes that all meaningful rewards arise exclusively from a finite set of achievements.
- The dependency graph heuristic may yield spurious edges if data coverage is insufficient or if correlations are merely accidental.
Future Directions:
- Online joint learning of representations, dependency structure, and sub-policies to unify the offline–online loop.
- Adoption of uncertainty-based methods for pruning unreliable dependencies.
- Extension to settings with partial observability or continuous achievement spaces.
- End-to-end integration with more expressive “reward machines” (e.g., temporal logic representations).
A plausible implication is that SEA’s structure discovery and hierarchical decomposition principles could extend to domains beyond classic achievement-based tasks, including those with latent or abstract compositional subgoal structure (Zhou et al., 2023).