MiniGrid DoorKey Environment

Updated 31 August 2025

MiniGrid DoorKey is a 2D gridworld testbed for RL that challenges agents with sparse rewards, partial observability, and sequential tasks like key retrieval and door unlocking.
The environment enables evaluation of advanced methods such as intrinsic motivation, hierarchical planning, and curriculum learning through techniques like DoWhaM and language-based exploration.
Researchers leverage DoorKey to benchmark sample efficiency, novelty adaptation, and safe reward shaping in both single-agent and multi-agent reinforcement learning scenarios.

The MiniGrid DoorKey Environment is a canonical reinforcement learning (RL) testbed for sequential, goal-oriented reasoning under extreme reward sparsity and partial observability. It requires an agent to retrieve a key, unlock a door, and access a goal location within a simple gridworld, forming an archetypal benchmark for the evaluation of exploration, hierarchical planning, and curriculum learning algorithms.

1. Environment Specification and Mechanics

The MiniGrid DoorKey environment, part of the MiniGrid library, is a 2D discrete gridworld defined as a partially observable Markov decision process (POMDP) with state space $X$ , action space $A$ , deterministic or stochastic transition function $T$ , sparse reward function $R$ , and an observation function $\Omega$ yielding an egocentric, locally limited field of view (Chevalier-Boisvert et al., 2023). The agent's task is to acquire a key, use it to unlock a door (via a “toggle” action), and reach a goal tile that is inaccessible without opening the door.

A typical grid configuration includes:

Walls demarcating rooms and hallways,
At least one locked door positioned to separate start and goal zones,
A matching key placed in a distinct location,
The agent and goal both randomly positioned in open grids.

The reward is typically binary: $R(s,a,s') = 1$ if the agent reaches the goal after completing the unlocking sequence, $0$ otherwise. Episodes conclude either on success or after a maximum number of steps.

The core mechanics include discrete navigation (forward, left, right), object interaction primitives (pickup, drop, toggle), and partial observability (grid cells visible within a forward-oriented window, usually 5×5 or 7×7 cells—much smaller than the full map). This supports research on memory, long-horizon credit assignment, and efficient exploration.

2. Exploration and Intrinsic Motivation Algorithms

Sparse extrinsic rewards in DoorKey render naive RL algorithms ineffective due to pronounced exploration bottlenecks. To address this, several intrinsic motivation (IM) techniques have been developed and systematically evaluated:

Count-based and Curiosity Methods

BeBold computes an intrinsic reward for transitions from visited to novel states: $r_i(s_t, a_t, s_{t+1}) = \max[1/N(s_{t+1}) - 1/N(s_t), 0]$ , optionally with episodic visitation count filtering (Zhang et al., 2020). This frontier-based heuristic outperforms standard bonus functions (e.g., $1/N(s)$) by forcing systematic expansion rather than short-sighted novelty-seeking.
Random Network Distillation (RND) approximates state visitation via prediction error over a random target network, but can suffer from detachment or myopic exploration (Zhang et al., 2020).
RC-GVF (Random Curiosity with General Value Functions) extends the curiosity paradigm by rewarding both TD error and predictor disagreement on long-horizon random pseudo-rewards across an ensemble, formulated as:

$R_i(o_t) = \sum_{j=1}^d \left[ \frac{1}{K} \sum_k (G_t^{(z_j)} - \hat{v}_{\pi,z_j}^k(H_t))^2 \times \frac{1}{K-1} \sum_k (\bar{v}_{\pi,z_j}(H_t) - \hat{v}_{\pi,z_j}^k(H_t))^2 \right]$

Thus focusing on epistemic uncertainty rather than irreducible aleatoric noise, and providing robust exploration even without access to ground-truth state counts (Ramesh et al., 2022).

Action Usefulness and Semantic Exploration

DoWhaM targets action effectiveness rather than state novelty, granting higher intrinsic bonus to actions that rarely change the environment, such as "toggle" (door opening), with

$B(a) = \frac{\eta^{1 - E(a)/U(a)} - 1}{\eta - 1}$

and

$r^\text{DoWhaM}_i(s_t, a_t, s_{t+1}) = \begin{cases} B(a_t) / \sqrt{N^\tau(s_{t+1})}, & s_t \neq s_{t+1} \ 0, & \text{otherwise} \end{cases}$

This focus leads to an order-of-magnitude reduction in sample complexity when unlocking doors is a bottleneck (Seurin et al., 2021).

Language-based Exploration

L-AMIGo and L-NovelD introduce natural language as the intermediate goal space and novelty metric. Intrinsic rewards are attributed both to state transitions and to the attainment of new language–described abstractions (such as "open the red door" or "pick up the key"), leveraging Random Network Distillation over language descriptions. These methods outperform coordinate-based goal exploration by 47–85% across a wide MiniGrid/KeyCorridor benchmark suite (Mu et al., 2022).

Hybrid Intrinsic Motivation

VSIMR (VAE-based state novelty) combined with LLM-derived semantic reward signals guides agents simultaneously towards state novelty and towards states that LLMs, given the environment description, score as progressing toward the goal. This dual reward is (after scaling):

$r_\text{total} = r_\text{extrinsic} + \beta_\text{vae}\,r_\text{intrinsic-vae} + \beta_\text{LLM}\,r_\text{intrinsic-LLM}$

Enabling agents to balance discovery and exploitation efficiently in DoorKey (Quadros et al., 25 Aug 2025).

3. Curriculum and Environment Generation

Addressing the combinatorial difficulty of DoorKey and similar compositional tasks, automated environment/curriculum generation methods have been developed:

Automatic Curriculum Design

CoDE employs an autoregressive generator to compose DoorKey-like tasks from subtask primitives (Petri net formalism), balancing task difficulty via a multi-objective reward comprising population-based regret and a difficulty budget:

$J^\text{PopRegret} = \max_p E_m[ R^\mathcal{E}(\pi^p) ] - \frac{1}{|\mathcal{P}|} \sum_{i=1}^{|\mathcal{P}|} E_m[ R^\mathcal{E}(\pi^i) ]$

$J^\text{Difficulty}(\mathcal{E}) = (\mathbb{I}[ R^\mathcal{E}(\pi^p) > \beta ] - \mathbb{I}[ R^\mathcal{E}(\pi^p) < \delta ]) \cdot \frac{N}{N_\text{max}}$

The generator agent is trained online with the RL population, resulting in agents that generalize to unseen composition and exhibit $3\times$ improvement over strong baselines (Gur et al., 2022).

Evolutionary Curriculum Optimization

RHEA CL uses a Rolling Horizon Evolutionary Algorithm to optimize over curricula defined as sequences of DoorKey environment variants; each curriculum is evaluated in terms of discounted returns:

$\mathit{score}_i = \sum_{j=0}^{L} \mathit{reward}_j \cdot \gamma^j$

Producing automatic, non-monotonic curricula that outperformed rule-based alternatives in DoorKey and DynamicObstacles, particularly in early-stage training (Jiwatode et al., 12 Aug 2024).

4. Safe Reward Shaping and Policy Optimality

Potential-based shaping is necessary to prevent reward hacking or suboptimal convergence when using intrinsic motivation in DoorKey:

PBIM (Potential-Based Intrinsic Motivation) transforms arbitrary IM signals, e.g., $F_t = \alpha / n(s)$ , into a potential-based form

$F'_t = \gamma \Phi_{t+1} - \Phi_t \text{ with } \Phi_t = -U^\pi_t$

This ensures that the cumulative shaped reward differs only by a constant with respect to the action at time $t$ , thus

$\arg\max_a Q'(s,a) = \arg\max_a Q(s,a)$

and the optimal policy is invariant. Both naïve and normalized variants were shown to accelerate learning and strictly prevent reward hacking in DoorKey, even for complex non-Markovian or trainable IM functions (Forbes et al., 12 Feb 2024, Forbes et al., 16 Oct 2024). The Generalized Reward Matching (GRM) framework further extends PBIM by allowing arbitrary matching functions $m_{t,i}$ , tuning the delay and distribution of correction terms (Forbes et al., 16 Oct 2024).

5. Response to Novelty and Adaptation Metrics

The introduction of controlled novelty into DoorKey via the NovGrid framework permits rigorous benchmarking of novelty adaptation (Balloch et al., 2022). For example, swapping which key opens the door or altering door number (barrier, delta, shortcut novelties) is trivial to implement as OpenAI Gym wrappers. The following adaptation metrics are used:

Resilience: Immediate post-novelty drop compared to random performance.
One-Shot Adaptive Performance: Reward after a single post-novelty episode.
Asymptotic Adaptive Performance: Performance after sufficient retraining in post-novelty.
Adaptive Efficiency: Steps to adaptation.

Baseline PPO agents in DoorKey show substantial loss of performance and slow recovery after even mild object-novelty (e.g., door-key mapping changes), indicating the need for explicit novelty adaptation mechanisms.

6. Model-Based, Programmatic, and LLM-Based Planning Approaches

Structured policy synthesis and symbolic world modeling represent a distinct direction:

WorldCoder creates an explicit, LLM-synthesized Python program as a world model (transition and reward functions) for DoorKey, imposing data consistency and optimism constraints ( $\varphi_1$ , $\varphi_2$ ):

$\varphi_1(\mathcal{D}, \hat{T}, \hat{R}) = \forall (s,a,s',r,d) \in \mathcal{D}:\ (\hat{T}, \hat{R}) \vdash (s, a, s', r, d)$

$\varphi_2(s_0, c, \hat{T}, \hat{R}) = \exists \langle a_1, s_1, \dots, a_\ell, s_\ell \rangle,\ \prod_{i=1}^{\ell} \hat{T}(s_{i-1}, a_i) = s_i \wedge \exists r > 0: \hat{R}(c)(s_{\ell-1},a_\ell,s_\ell) = (r, 1)$

This approach achieves dramatically improved sample efficiency (orders of magnitude reduction compared to deep RL) and supports knowledge transfer across gridworlds by reusing world model components (Tang et al., 19 Feb 2024).

Iterative Programmatic Planning (IPP) leverages LLMs to synthesize and iteratively refine agent policy code for DoorKey. The framework employs feedback-driven code updating: after executing on batches of DoorKey instances, the lowest-performing cases drive LLM-based code improvement. This approach yields a $26\%$ average reward improvement for GPT-o3-mini and 17% for Claude-3.7 on DoorKey versus direct code generation, with large amortized computational and financial advantages due to policy code reuse (Aravindan et al., 15 May 2025).

7. Multi-Agent and Hierarchical Models

Extensions beyond single-agent, flat policy learning include:

Agent-Time Attention (ATA): For multi-agent DoorKey, ATA uses a transformer to jointly attend over agents and time, enabling dense, agent-specific reward redistribution from sparse team signals:

$\tilde{r}_i^t = \hat{R}_i^{t+1} - \hat{R}_i^t$

greatly benefitting coordinated, temporally-complex tasks (She et al., 2022).

Hierarchical Active Inference: For navigation tasks involving multiple rooms and doors, agents employ a three-level cognitive map: context (topological), place (allocentric), and motion (egocentric). Policy selection minimizes expected free energy over this hierarchy:

$G(\pi,T,\tau) = \sum_k W_k [\text{epistemic term}_k + \text{preference-seeking term}_k]$

This model enables efficient, aliasing-robust navigation and segmentation in maze-like DoorKey variants (Tinguy et al., 2023).

Table: Summary of Key Exploration Methods in DoorKey

Method	Core Mechanism	Empirical Effect in DoorKey
BeBold	Frontier intrinsic reward	SoTA w/o curriculum, efficient search
DoWhaM	Action usefulness bonus	Focused, low-sample door discovery
L-AMIGo/NovelD	Language-goal/novelty bonus	+47–85% over coord. novelty, interpretable curriculum
PBIM/GRM	Potential-based shaping of IM	Preserves optimality, faster convergence
RHEA CL	Evolutionary curriculum sched.	Improved early/final DoorKey perf.
IPP/WorldCoder	LLM-based code synthesis	High sample efficiency, full transfer
RC-GVF	Multi-step curiosity ensemble	Strong without access to counts

8. Challenges and Open Directions

Several limitations and points of active research are evident:

Many count-based exploration methods require approximations or surrogates in large/partial observation spaces; panoramic observations, episodic RND, or language-annotation can help.
Maintaining optimality under intrinsic shaping is nontrivial—PBIM/GRM supply theoretical guarantees but broader extensions (e.g., to trainable/continuous IM, large-scale language-driven rewards) are still emerging (Forbes et al., 12 Feb 2024, Forbes et al., 16 Oct 2024).
In model-based/transfer settings, policy representations that unify code synthesis, LLM-knowledge, and powerful RL are still under exploration for general gridworld and DoorKey transfer (Tang et al., 19 Feb 2024, Aravindan et al., 15 May 2025).

The MiniGrid DoorKey environment continues to serve as a premier benchmark for advancing RL algorithms in sparse reward, compositional, and transfer settings, with state-of-the-art approaches increasingly focused on sophisticated exploration signals, curriculum learning, and principled optimization of reward shaping.