Papers
Topics
Authors
Recent
Search
2000 character limit reached

Structured Exploration with Achievements (SEA)

Updated 8 March 2026
  • SEA is a multi-stage reinforcement learning algorithm for achievement-based environments that structures exploration using offline data and a determinant loss.
  • It employs goal-conditioned sub-policies and a meta-controller to recover dependency graphs and systematically unlock achievements from sparse rewards.
  • Empirical evaluations in Crafter and TreeMaze demonstrate SEA's ability to consistently unlock rare achievements and outperform traditional RL methods.

Structured Exploration with Achievements (SEA) is a multi-stage reinforcement learning (RL) algorithm designed for exploration in environments with a discrete internal achievement set, particularly in the presence of sparse rewards. It leverages offline data to disentangle and structure the achievement space, recovers a dependency graph of achievements, and uses goal-conditioned sub-policies coordinated by a meta-controller to drive systematic exploration and mastery of complex domains with high-dimensional observations (Zhou et al., 2023).

1. Achievement-Based Environments: Formalization

SEA is built for environments formalized as @@@@1@@@@ with Achievements (MDPA), specified by the tuple

(S,A,T,Γ,G)(\,\mathcal S,\,\mathcal A,\,T,\,\Gamma,\,G\,)

where:

  • S\mathcal S is the observation space (e.g., pixel images),
  • A\mathcal A is the discrete or continuous action space,
  • T:S×AΔ(S)T:\mathcal S\times\mathcal A\to\Delta(\mathcal S) is the transition kernel,
  • Γ={γ1,,γN}\Gamma=\{\gamma_1, \ldots, \gamma_N\} denotes the finite set of achievement labels,
  • G:S×A×SΓ{}G:\mathcal S\times\mathcal A\times\mathcal S\to\Gamma\cup\{\emptyset\} is the achievement-completion function, mapping transitions to unlocked achievements or “none.”

Reward at time tt is delivered only for the first unlock of each achievement per episode, given by

rt=1{G(st,at,st+1)}× ⁣ ⁣t<t ⁣ ⁣1{G(st,at,st+1)G(st,at,st+1)}r_t = \mathbf{1}\{G(s_t, a_t, s_{t+1})\neq\emptyset\} \times \!\!\prod_{t'<t} \!\!\mathbf{1}\{G(s_{t'}, a_{t'}, s_{t'+1})\neq G(s_t, a_t, s_{t+1})\}

To restore the Markov property, the environment can be augmented with the set of achievements unlocked so far: S=S×2Γ\mathcal S^* = \mathcal S\times2^\Gamma.

2. Offline Representation Learning Using Determinant Loss

SEA’s first phase uses offline trajectories {τ}\{\tau\} to build an achievement-disentangled representation. The goal is to learn a function ϕθ:S×A×SRW\phi_\theta:\mathcal S \times \mathcal A \times \mathcal S\to\mathbb R^W such that transitions unlocking distinct achievements in a single episode yield maximally separated (near-orthogonal) embeddings.

This is accomplished by the determinantal loss. For each trajectory, collect the time steps UτU_\tau with rt=1r_t = 1. For pairs (ti,tj)Uτ(t_i, t_j) \in U_\tau:

  • Build DτD_\tau where (Dτ)ij=ϕθ(sti,ati,sti+1)ϕθ(stj,atj,stj+1)2(D_\tau)_{ij} = \|\,\phi_\theta(s_{t_i}, a_{t_i}, s_{t_i+1}) - \phi_\theta(s_{t_j}, a_{t_j}, s_{t_j+1})\|_2,
  • Normalize: Dˉτ=Dτ/maxi,j(Dτ)ij\bar D_\tau = D_\tau / \max_{i,j}(D_\tau)_{ij},
  • Minimize

Lachv(θ)=Eτ[det(exp(kDˉτ))]\mathcal L_{\rm achv}(\theta) = \mathbb E_{\tau}\left[-\det\left(\exp(-k \bar D_\tau)\right)\right]

for a temperature constant k>0k>0. This encourages the embeddings for different achievements within episodes to be mutually orthogonal, preventing collapse or clustering. The closed-form gradient is given (see original source), facilitating direct optimization (Zhou et al., 2023).

3. Achievement Dependency Graph Extraction

After learning the representation, clusters of achievement transitions (using an achievement classifier ACAC) provide achievement labels for transitions. Exploiting the ordering structure within logged trajectories, SEA recovers a minimal directed acyclic graph (DAG) representing achievement dependencies via a heuristic:

  • For each trajectory, count the number of times each achievement jj occurs, and for each pair i,ji, j, count occurrences where ii precedes jj.
  • An edge iji \rightarrow j is kept in the initial graph if

Before[i,j]Happen[j]>1ϵ\frac{\text{Before}[i,j]}{\text{Happen}[j]} > 1 - \epsilon

where ϵ\epsilon is a tolerance hyperparameter.

  • Transitive or redundant edges are pruned: An edge iji \rightarrow j is discarded if its removal does not alter the earliest possible topological position of jj in the DAG.
  • The result is a minimal dependence graph (see extracted pseudocode in original data).

This produces a recovered DAG over achievements, which is then used to structure online exploration.

4. Hierarchical Policy and Meta-Controller

With the structured dependency graph, SEA devises a family of goal-conditioned sub-policies σω(as,g)\sigma_\omega(a\mid s, g), each tasked with unlocking achievement g{1,,N}g\in\{1, \ldots, N\}. Each sub-policy trains by maximizing

Jg(ω)=E[t=0γtrtg],J_g(\omega) = \mathbb E\Bigl[\sum_{t=0}^\infty \gamma^t\,r^g_t\Bigr],

where rtg=1{AC(st,at,st+1)=g}r^g_t = \mathbf{1}\{AC(s_t,a_t,s_{t+1}) = g\}.

A meta-controller πmeta(gs)\pi_{\rm meta}(g \mid s) topologically traverses the available nodes in the dependency DAG. At the start of each episode and after each achievement is unlocked, it selects the next target achievement whose parent dependencies have all been achieved. The system can choose the next goal randomly or, with a small probability, explore alternative orders to promote broader discovery.

Once a target is chosen, control is routed to the corresponding σω\sigma_\omega. Upon success, the meta-controller selects a new subgoal. This hierarchical design enables systematic and data-efficient exploration, especially critical in sparse-reward settings.

5. Empirical Performance in Crafter and TreeMaze

Extensive evaluation was conducted in two procedurally generated domains: Crafter (2D Minecraft-like with 21 achievements) and TreeMaze (minigrid-style, 9–30 achievements, varying sizes) (Zhou et al., 2023). Baselines include IMPALA (vanilla and RND-augmented), PPO, DreamerV2, and HAL.

Key Quantitative Outcomes

SEA demonstrates high clustering accuracy (purity 0.99–1.00), perfectly or near-perfectly recovers achievement dependency graphs in TreeMaze, and achieves a Hamming error of 11/272 edges in full Crafter.

The table below summarizes achievement unlocking rates on Crafter (easy-set: frequent, hard-set: infrequent achievements, last 50M steps):

Method Easy-set % (±) Hard-set % (±)
IMPALA 93.9 (0.1) 0.0 (0.0)
IMPALA+RND 93.2 (1.1) 0.15 (0.21)
PPO 34.9 (2.2) 0.0 (0.0)
DreamerV2 76.2 (11.2) 0.0 (0.0)
HAL 15.8 (0.8) 0.0 (0.0)
SEA 92.5 (0.5) 49.3 (2.9)

SEA is the only method that consistently unlocks challenging rare achievements (e.g., “collect_diamond” at ∼4% occurrence). The Crafter score, S=exp(1Nln(1+si))1S=\exp(\frac1N \sum\ln(1+s_i))-1, improves from 39 (IMPALA) up to 75.5 (SEA).

6. Strengths, Limitations, and Prospects

Strengths:

  • Unsupervised recovery of invariant achievement structures in procedurally generated environments.
  • Determinant-based representation learning yields highly disentangled, clusterable achievement embeddings.
  • The dependency-driven meta-controller facilitates directed discovery of complex, hard-to-reach subgoals.
  • Empirically, SEA is the first approach to reach the most challenging achievements in Crafter.

Limitations:

  • Necessitates an initial exploration dataset containing most achievements.
  • Assumes that all meaningful rewards arise exclusively from a finite set of achievements.
  • The dependency graph heuristic may yield spurious edges if data coverage is insufficient or if correlations are merely accidental.

Future Directions:

  • Online joint learning of representations, dependency structure, and sub-policies to unify the offline–online loop.
  • Adoption of uncertainty-based methods for pruning unreliable dependencies.
  • Extension to settings with partial observability or continuous achievement spaces.
  • End-to-end integration with more expressive “reward machines” (e.g., temporal logic representations).

A plausible implication is that SEA’s structure discovery and hierarchical decomposition principles could extend to domains beyond classic achievement-based tasks, including those with latent or abstract compositional subgoal structure (Zhou et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Structured Exploration with Achievements (SEA).