NetHack Learning Environment (NLE)

Updated 2 March 2026

NetHack Learning Environment is a complex RL benchmark that features a procedurally generated, stochastic setting with vast state and action spaces.
It offers a gym-compatible API delivering multimodal observations to support research in reinforcement learning, imitation learning, and hierarchical planning.
NLE enables exploration of sample efficiency, transfer learning, and hybrid neural-symbolic architectures while simulating long-horizon, multi-stage tasks.

The NetHack Learning Environment (NLE) is a highly complex, extensible benchmark for reinforcement learning (RL), imitation learning, skill discovery, and agent generalization research. It builds upon the open-source NetHack 3.6.6 game, providing a procedurally generated, stochastic, long-horizon environment with a combinatorially vast state and action space. NLE is used as both a scientific RL environment and a venue for benchmark competitions, such as the NeurIPS 2021 NetHack Challenge. It has catalyzed new research on sample efficiency, hierarchical planning, knowledge transfer, LLM-based skill composition, and hybrid neural-symbolic architectures.

1. Environment Architecture and Formal Specification

NLE interfaces directly with the NetHack engine, exposing a gym-compatible API and delivering rich, multimodal observations at each discrete time step. Each observation comprises:

A glyph matrix $g_t \in \{0, \ldots, G-1\}^{H \times W}$ (with $G \approx 5990$ , $H=21$ , $W=79$ ),
ASCII codes and color matrices,
"blstats": a vector of scalar statistics (HP, experience, hunger, etc., $S \approx 25$ fields),
The last message string ( $m_t$ ) and inventory information,
Optionally, image-based "pixel" rendering and auxiliary encodings (e.g., egocentric crop).

The discrete action space mirrors the full NetHack keyboard, representing up to 121 atomic keystrokes (move, eat, apply, cast, engrave, etc.), depending on the configuration. High-level wrappers translate multi-step key sequences, resolve in-game prompts, and ensure deterministic reproducibility through PRNG seeding (Küttler et al., 2020, Hambro et al., 2022, Piterbarg et al., 2023).

The base reward is the difference in NetHack's internal score: $r_t = \mathrm{score}_t - \mathrm{score}_{t-1}$ , reflecting agent progress over a dense but highly delayed reward distribution. Alternative tasks in the suite use domain-shaped rewards (staircase, pet adjacency, nutrition, explored tiles, gold collection, Oracle discovery).

Procedural generation stochastically samples dungeon topology, monster placement, item distribution, and special events per seed, enforcing robust generalization and preventing memorization. Episodes span tens to hundreds of thousands of time steps before termination by death or timeout.

2. Benchmark Tasks and Evaluation Protocols

NLE's initial task suite targets distinct RL challenges: local and global exploration, sparse and dense rewards, and compositional behavior.

Task	Reward Function	Episode Terminus
Staircase	$+100$ on using "down staircase"	1,000 steps or goal
Pet	$+100$ with pet adjacency on stairs	1,000 steps or goal
Eat	$\Delta$ Nutrition	20,000 steps or death
Gold	$\Delta$ Gold	20,000 steps or death
Scout	$+1$ per newly observed tile	10,000 steps or death
Score	$\Delta$ Score	100,000 steps or death
Oracle	$+1000$ on Oracle level	max episode length

Evaluation is based on cumulative reward (score), success rates, survival metrics (steps survived, tiles uncovered), and navigation or interaction criteria. Challenge formats use lexicographically ordered metrics: number of ascensions, median in-game score, and mean in-game score over a large set ( $N=4096$ ) of randomized episodes (Hambro et al., 2022).

3. Agent Approaches: RL, Imitation, Hierarchy, and Hybridization

Early NLE baselines implemented distributed RL (IMPALA, APPO, PPO variants), with policies ingesting embedded glyphs (via CNN or ResNet), blstats (MLP), and temporal context (LSTM). Random Network Distillation (RND) and other intrinsic motivation regimes were applied, but yielded modest gains due to the richness and novelty of the state space (Küttler et al., 2020).

Behavioral cloning from large corpora of expert demonstrations (notably, AutoAscend and HiHack datasets) became a core strategy for bootstrapping policy initialization (Tuyls et al., 2023, Piterbarg et al., 2023). Large-scale IL training (up to 150 billion state-action pairs) revealed clean power-law scaling between model/data size and achieved mean score; compute-optimal agents reached mean scores of $2,740$ (random start) to $5,218$ (fixed human monk), $1.5\times$ prior SOTA for neural agents, but still remained well below the best symbolic bots (Tuyls et al., 2023).

Hierarchical agents, imposed via explicit skill/option labels from symbolic experts, trained using hierarchical behavior cloning and two-term cross-entropy, achieved substantial performance gains. Hybrid agents (e.g., RAPH, LuckyMera-v1.0) composed symbolic rulesets (for survival, inventory, exploration) with RL-trained neural subskills, outperforming pure neural agents and bridging a fraction of the gap to symbolic SOTA (Quarantiello et al., 2023, Piterbarg et al., 2023, Hambro et al., 2022).

Recent approaches leverage LLMs for automatic skill and option discovery, reward shaping, and high-level policy synthesis (e.g., MaestroMotif). These systems learn intrinsic skill rewards via LLM preference elicitation, generate code for initiation/termination conditions, and allow zero-shot recombination of skills for compositional downstream tasks, exceeding traditional RL frameworks in adaptivity and sample efficiency (Klissarov et al., 2024).

4. Transfer, Fine-tuning, and Forgetting Mitigation

Transferring pre-trained neural policies in NLE is sensitive to catastrophic forgetting, particularly for skills tied to under-visited sectors of the state space. Vanilla fine-tuning with APPO quickly erodes pre-trained deep-dungeon expertise—mean score collapses from $5,000$ to $\sim 600$ on Human Monk. Empirical diagnostics include per-level evaluation snapshots, density plots of level vs. turns, and full return histograms (Figure 1, (Wołczyk et al., 2024)).

Effective forgetting mitigation augments the RL loss with knowledge retention mechanisms:

Elastic Weight Consolidation (EWC): Quadratic penalization by pre-trained Fisher information;
Behavioral Cloning Replay (BC): KL regularization between online student and frozen teacher on stored expert states;
Kick-starting (KS): KL regularization on states sampled during online student exploration.

Formally,

$L_\text{EWC}(\theta) = \lambda \sum_i F_i (\theta_i - \theta_i^*)^2$

$L_\text{BC}(\theta) = \kappa \ \mathbb{E}_{s \sim \mathcal{D}_*} \big[ D_\text{KL}(\pi^*(\cdot|s) \Vert \pi_\theta(\cdot|s)) \big]$

$L_\text{KS}(\theta) = \eta \ \mathbb{E}_{s \sim \pi_\theta} \big[ D_\text{KL}(\pi^*(\cdot|s) \Vert \pi_\theta(\cdot|s)) \big]$

The optimal regularization strategy depends on whether the forgetting is driven by coverage gaps or imperfect cloning. With the right auxiliary loss and hyperparameters, neural agents attain new SOTA, doubling mean score (to $10,588\pm672$ with KS for Human Monk) and preserving performance across all explored dungeon depths (Wołczyk et al., 2024).

5. Representation Learning, Exploration, and Offline Pre-training

NLE's long-horizon, sparse-reward structure necessitates robust world models and exploration priors. Decoupled offline pre-training on human demonstration trajectories enables learning:

Contrastive State Representations: Embeddings $f_\phi(s)$ via InfoNCE loss, which encode future visitation structure and generalize to new regions.
Auxiliary Progress Rewards: Predictors $g_\theta(s_t, s_{t+\Delta t})$ , trained to estimate temporal distance, serve as intrinsic exploration bonuses during online RL.

This separation of inductive biases allows orthogonal improvements: $g$ accelerates frontier discovery, while $f$ enhances representation of newly reached areas. Empirically, this approach yields $6\times$ sample efficiency over tabula-rasa RL on dense tasks and $3\times$ on sparse ones in NLE (Mazoure et al., 2023). Decoupled pre-training also robustly outperforms joint objectives or ad hoc representation reuse.

6. Hybrid, Symbolic, and Modular Agent Architectures

Symbolic, modular, and hybrid architectures are prevalent in NLE due to its structured combinatorics and high failure penalty. LuckyMera (and related frameworks) decompose agent logic into prioritized "skills," each with plan and execute functions. These skills may be entirely symbolic (domain heuristics, pathfinding, hard-coded routines) or encapsulate neural policies (e.g., trained navigation or combat modules). Hybrid agents leverage symbolic safety checks and high-level orchestration while delegating perception or low-level actuation to neural components.

LuckyMera empirically demonstrates that plugging symbolic rules into learned agent architectures yields $30\%$ mean return improvement on MiniHack and top-6 leaderboard performance on full NLE (mean score $1,046.96$; median $817$ across 1,000 episodes), outperforming $80\%$ of prior submitted agents (Quarantiello et al., 2023). Ablation confirms that removal of neural subskills or core symbolic routines degrades performance, underscoring the value of compositional design and domain knowledge.

Zero-shot generalization to novel composite tasks is enabled by frameworks such as MaestroMotif, which use LLM-driven skill interfaces, code policies, and intrinsic reward definitions. These systems can recombine previously trained skills in new task sequences by prompting LLMs with goal descriptions, achieving higher success rates on complex navigation and interaction benchmarks compared to both RL and non-RL baselines (Klissarov et al., 2024).

7. Open Problems, Research Directions, and Community Insights

Despite marked progress, NLE remains an unsolved grand challenge for RL/IL, illustrating the current limits of scalable, autonomous decision-making. Key difficulty factors include:

Sparse and multi-stage rewards: Score increases and critical events are infrequent and often delayed thousands of steps.
Partial, multi-modal observations: State estimation and message parsing are essential for survival and planning.
Procedural variability and combinatorial growth: No two episodes are alike; successful policies must generalize structurally, not memorize.
Hierarchical competence requirements: Multi-step skills (food management, curse identification, puzzle solving) must be learned or supplied a priori.

Symbolic agents and hybrids built on explicit strategy stacks (AutoAscend, RAPH, LuckyMera) still outperform even the best large-scale neural policies by factors of $2\times$ – $4\times$ in median score (Hambro et al., 2022, Piterbarg et al., 2023, Tuyls et al., 2023). Scaling data and model size (behavioral cloning) yields predictable returns (clean power laws), but falls short of true expert play—forecasting human-level performance implies needing billions of parameters and trillions of samples (Tuyls et al., 2023).

Recent advances in LLM-based skill learning, modularity, and reward synthesis illustrate the promise of integrating symbolic, neural, and language-centric paradigms. Efficient sample reuse (offline pre-training), compositional skill discovery, and robust knowledge transfer/retention are active research areas that directly benefit from NLE's structural challenges (Klissarov et al., 2024, Mazoure et al., 2023, Wołczyk et al., 2024).

NLE serves as a canonical, reproducible testbed for studying systematic generalization, long-horizon planning, safe reinforcement learning, and hierarchical competence—offering a powerful, computationally accessible alternative to larger-scale simulators without sacrificing scientific rigor (Küttler et al., 2020, Hambro et al., 2022).