Structured Exploration with Achievements (SEA)

Updated 8 March 2026

SEA is a multi-stage reinforcement learning algorithm for achievement-based environments that structures exploration using offline data and a determinant loss.
It employs goal-conditioned sub-policies and a meta-controller to recover dependency graphs and systematically unlock achievements from sparse rewards.
Empirical evaluations in Crafter and TreeMaze demonstrate SEA's ability to consistently unlock rare achievements and outperform traditional RL methods.

Structured Exploration with Achievements (SEA) is a multi-stage reinforcement learning (RL) algorithm designed for exploration in environments with a discrete internal achievement set, particularly in the presence of sparse rewards. It leverages offline data to disentangle and structure the achievement space, recovers a dependency graph of achievements, and uses goal-conditioned sub-policies coordinated by a meta-controller to drive systematic exploration and mastery of complex domains with high-dimensional observations (Zhou et al., 2023).

1. Achievement-Based Environments: Formalization

SEA is built for environments formalized as @@@@1@@@@ with Achievements (MDPA), specified by the tuple

$(\,\mathcal S,\,\mathcal A,\,T,\,\Gamma,\,G\,)$

where:

$\mathcal S$ is the observation space (e.g., pixel images),
$\mathcal A$ is the discrete or continuous action space,
$T:\mathcal S\times\mathcal A\to\Delta(\mathcal S)$ is the transition kernel,
$\Gamma=\{\gamma_1, \ldots, \gamma_N\}$ denotes the finite set of achievement labels,
$G:\mathcal S\times\mathcal A\times\mathcal S\to\Gamma\cup\{\emptyset\}$ is the achievement-completion function, mapping transitions to unlocked achievements or “none.”

Reward at time $t$ is delivered only for the first unlock of each achievement per episode, given by

$r_t = \mathbf{1}\{G(s_t, a_t, s_{t+1})\neq\emptyset\} \times \!\!\prod_{t'<t} \!\!\mathbf{1}\{G(s_{t'}, a_{t'}, s_{t'+1})\neq G(s_t, a_t, s_{t+1})\}$

To restore the Markov property, the environment can be augmented with the set of achievements unlocked so far: $\mathcal S^* = \mathcal S\times2^\Gamma$ .

2. Offline Representation Learning Using Determinant Loss

SEA’s first phase uses offline trajectories $\{\tau\}$ to build an achievement-disentangled representation. The goal is to learn a function $\phi_\theta:\mathcal S \times \mathcal A \times \mathcal S\to\mathbb R^W$ such that transitions unlocking distinct achievements in a single episode yield maximally separated (near-orthogonal) embeddings.

This is accomplished by the determinantal loss. For each trajectory, collect the time steps $U_\tau$ with $r_t = 1$ . For pairs $(t_i, t_j) \in U_\tau$ :

Build $D_\tau$ where $(D_\tau)_{ij} = \|\,\phi_\theta(s_{t_i}, a_{t_i}, s_{t_i+1}) - \phi_\theta(s_{t_j}, a_{t_j}, s_{t_j+1})\|_2$ ,
Normalize: $\bar D_\tau = D_\tau / \max_{i,j}(D_\tau)_{ij}$ ,
Minimize

$\mathcal L_{\rm achv}(\theta) = \mathbb E_{\tau}\left[-\det\left(\exp(-k \bar D_\tau)\right)\right]$

for a temperature constant $k>0$ . This encourages the embeddings for different achievements within episodes to be mutually orthogonal, preventing collapse or clustering. The closed-form gradient is given (see original source), facilitating direct optimization (Zhou et al., 2023).

3. Achievement Dependency Graph Extraction

After learning the representation, clusters of achievement transitions (using an achievement classifier $AC$ ) provide achievement labels for transitions. Exploiting the ordering structure within logged trajectories, SEA recovers a minimal directed acyclic graph (DAG) representing achievement dependencies via a heuristic:

For each trajectory, count the number of times each achievement $j$ occurs, and for each pair $i, j$ , count occurrences where $i$ precedes $j$ .
An edge $i \rightarrow j$ is kept in the initial graph if

$\frac{\text{Before}[i,j]}{\text{Happen}[j]} > 1 - \epsilon$

where $\epsilon$ is a tolerance hyperparameter.

Transitive or redundant edges are pruned: An edge $i \rightarrow j$ is discarded if its removal does not alter the earliest possible topological position of $j$ in the DAG.
The result is a minimal dependence graph (see extracted pseudocode in original data).

This produces a recovered DAG over achievements, which is then used to structure online exploration.

4. Hierarchical Policy and Meta-Controller

With the structured dependency graph, SEA devises a family of goal-conditioned sub-policies $\sigma_\omega(a\mid s, g)$ , each tasked with unlocking achievement $g\in\{1, \ldots, N\}$ . Each sub-policy trains by maximizing

$J_g(\omega) = \mathbb E\Bigl[\sum_{t=0}^\infty \gamma^t\,r^g_t\Bigr],$

where $r^g_t = \mathbf{1}\{AC(s_t,a_t,s_{t+1}) = g\}$ .

A meta-controller $\pi_{\rm meta}(g \mid s)$ topologically traverses the available nodes in the dependency DAG. At the start of each episode and after each achievement is unlocked, it selects the next target achievement whose parent dependencies have all been achieved. The system can choose the next goal randomly or, with a small probability, explore alternative orders to promote broader discovery.

Once a target is chosen, control is routed to the corresponding $\sigma_\omega$ . Upon success, the meta-controller selects a new subgoal. This hierarchical design enables systematic and data-efficient exploration, especially critical in sparse-reward settings.

5. Empirical Performance in Crafter and TreeMaze

Extensive evaluation was conducted in two procedurally generated domains: Crafter (2D Minecraft-like with 21 achievements) and TreeMaze (minigrid-style, 9–30 achievements, varying sizes) (Zhou et al., 2023). Baselines include IMPALA (vanilla and RND-augmented), PPO, DreamerV2, and HAL.

Key Quantitative Outcomes

SEA demonstrates high clustering accuracy (purity 0.99–1.00), perfectly or near-perfectly recovers achievement dependency graphs in TreeMaze, and achieves a Hamming error of 11/272 edges in full Crafter.

The table below summarizes achievement unlocking rates on Crafter (easy-set: frequent, hard-set: infrequent achievements, last 50M steps):

Method	Easy-set % (±)	Hard-set % (±)
IMPALA	93.9 (0.1)	0.0 (0.0)
IMPALA+RND	93.2 (1.1)	0.15 (0.21)
PPO	34.9 (2.2)	0.0 (0.0)
DreamerV2	76.2 (11.2)	0.0 (0.0)
HAL	15.8 (0.8)	0.0 (0.0)
SEA	92.5 (0.5)	49.3 (2.9)

SEA is the only method that consistently unlocks challenging rare achievements (e.g., “collect_diamond” at ∼4% occurrence). The Crafter score, $S=\exp(\frac1N \sum\ln(1+s_i))-1$ , improves from 39 (IMPALA) up to 75.5 (SEA).

6. Strengths, Limitations, and Prospects

Strengths:

Unsupervised recovery of invariant achievement structures in procedurally generated environments.
Determinant-based representation learning yields highly disentangled, clusterable achievement embeddings.
The dependency-driven meta-controller facilitates directed discovery of complex, hard-to-reach subgoals.
Empirically, SEA is the first approach to reach the most challenging achievements in Crafter.

Limitations:

Necessitates an initial exploration dataset containing most achievements.
Assumes that all meaningful rewards arise exclusively from a finite set of achievements.
The dependency graph heuristic may yield spurious edges if data coverage is insufficient or if correlations are merely accidental.

Future Directions:

Online joint learning of representations, dependency structure, and sub-policies to unify the offline–online loop.
Adoption of uncertainty-based methods for pruning unreliable dependencies.
Extension to settings with partial observability or continuous achievement spaces.
End-to-end integration with more expressive “reward machines” (e.g., temporal logic representations).

A plausible implication is that SEA’s structure discovery and hierarchical decomposition principles could extend to domains beyond classic achievement-based tasks, including those with latent or abstract compositional subgoal structure (Zhou et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Structured Exploration with Achievements (SEA).

Structured Exploration with Achievements (SEA)

1. Achievement-Based Environments: Formalization

2. Offline Representation Learning Using Determinant Loss

3. Achievement Dependency Graph Extraction

4. Hierarchical Policy and Meta-Controller

5. Empirical Performance in Crafter and TreeMaze

Key Quantitative Outcomes

6. Strengths, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Structured Exploration with Achievements (SEA)

1. Achievement-Based Environments: Formalization

2. Offline Representation Learning Using Determinant Loss

3. Achievement Dependency Graph Extraction

4. Hierarchical Policy and Meta-Controller

5. Empirical Performance in Crafter and TreeMaze

Key Quantitative Outcomes

6. Strengths, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research