NetHack Learning Environment

Updated 29 April 2026

NetHack Learning Environment is a research platform that formalizes NetHack as a complex, partially observable Markov decision process for advanced sequential decision-making.
It offers a Gym-compatible API with procedurally generated levels, discrete actions, and multimodal observations, enabling deep exploration and skill acquisition.
NLE integrates hierarchical methods and extensive offline datasets to benchmark reinforcement and imitation learning, revealing key challenges in exploration and generalization.

The NetHack Learning Environment (NLE) is a high-fidelity research platform for reinforcement learning, imitation learning, and hierarchical decision-making, built atop the complex roguelike game NetHack. NLE exposes a procedurally generated, partially observable, combinatorially rich Markov decision process with a Gym-compatible interface. It supports rapid simulation and extensive customization, enabling investigation into exploration, planning, skill acquisition, transfer, continual learning, and offline/hybrid RL. With its integration of large-scale human and bot datasets, and instrumentation for curriculum and intrinsic motivation, NLE has become a central benchmark for advanced AI in sequential decision-making domains.

1. Formal Definition and Simulation Interface

NLE formalizes NetHack as a stochastic, partially observable Markov Decision Process (POMDP)

$\mathcal{M} = \langle S,\,A,\,O,\,P,\,\Omega,\,R,\,\gamma \rangle$

where:

$S$ is the set of internal game states, encoding map, items, agent stats, monster states, and RNG.
$A$ is a discrete set of primitive actions (93–121 in the main benchmarks), spanning direction moves, compound commands, and menu navigation. The precise set aligns with NetHack’s native keystroke interface, covering all context-sensitive operations (Küttler et al., 2020, Hambro et al., 2022).
$O$ comprises observation vectors, e.g., glyph grids ( $21 \times 79$ ), color, character codes, in-game messages, inventory representations, and hero stats.
$P(s'|s,\,a)$ is governed by the compiled NetHack C engine, encapsulating procedural dungeon generation, stochastic dynamics, and randomized encounters.
$\Omega(o|s)$ maps true internal states to structured observations.
$R$ , the reward signal, is task-dependent: default is $\Delta(\text{score})$ , but specialized tasks use goal, exploration (scout), or event-based rewards.
$\gamma$ is the discount, typically $S$ 0– $S$ 1 to accommodate extremely long-horizon dependencies.

NLE exposes a Gym-style API, supporting reset() for episodic environment initialization and step(a) for synchronous agent–environment interaction. Observations are supplied as structured Python dicts of arrays; actions are integer-coded. Episode termination occurs upon permadeath, ascension, or fixed timeouts ( $S$ 2 steps in NLE Challenge).

2. Procedural Generation, Observation, and Action Spaces

All dungeons in NLE are generated procedurally from a random seed: each episode dynamically constructs up to 50 levels (rooms, corridors, branches, puzzles), places hundreds of monsters and items, and varies hero race/role/alignment to enforce generalization (Küttler et al., 2020, Hambro et al., 2022). Observations consist of:

Glyph map ( $S$ 3), possibly cropped ( $S$ 4, $S$ 5) for efficiency.
Color/char grids (ANSI encoded).
Scalar status features (blstats: position, HP, armor, gold, depth, hunger, etc.).
Text/ASCII windows (game messages, inventory).
Inventory encodings via fixed-length item glyph IDs and string representations.
Optionally, full VT100 terminal state (tty_chars/tty_colors/tty_cursor).

Action spaces are context-sensitive but always drawn from the NetHack command grammar (move, search, zap, wear, throw, apply, menu navigation). Menu actions may not advance the in-game turn counter, decoupling agent steps from NetHack’s internal time (Matthews et al., 27 Apr 2026).

3. Task Suite, Benchmarks, and Reward Design

NLE’s task suite encompasses both full-game and modular challenges:

Score: Accumulate as much in-game score as possible; sparse, long-horizon metric.
Staircase, Pet, Eat, Gold, Scout, Oracle: Isolated subgoals with dense, event-driven rewards to enable focused study of navigation, inventory, exploration, and partial observability.
SkillHack: A curated set of MiniHack-based tasks with custom command subsets and sparse terminal reward, each decomposable into human-interpretable skills (PickUp, Wear, FreezeLava, etc.) (Matthews et al., 2022).

Reward functions in NLE are user-configurable: score deltas, newly revealed tiles (+1 per tile), downstair events (+100), nutrition increments, and goal event completions. Shaping and normalization (e.g., tanh or clip) are employed to stabilize learning (Küttler et al., 2020, Kurenkov et al., 2023).

4. Imitation, Offline, and Hybrid Learning: Datasets and Methodology

The NetHack Learning Dataset (NLD) is the canonical large-scale corpus for offline learning, imitation, and hybrid RL:

NLD-NAO: $S$ 6 state-only transitions, 1.5M games from 48K human demonstrators.
NLD-AA: $S$ 7 state–action–score steps from AutoAscend (symbolic NHC ‘21 winner), covering all roles/races/alignments; episodic lengths up to $S$ 8 steps (Hambro et al., 2022).
Katakomba: D4RL-style per-character config splits, comprehensive PyTorch/HDF5 data loaders, and normalized evaluation (Kurenkov et al., 2023).

Standard learning protocols comprise Behavioral Cloning (BC), Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), APPO with/without kickstarting, and hybrid approaches that mix offline and online gradients. All methods perform vastly subhuman, often failing to escape early-game local minima—e.g., 97.5% of policies die before dungeon level 2, even in offline data-rich regimes.

Key algorithmic refinements include per-step reward clipping, batch-offline replay integration, and bootstrapped score normalization. Evaluation metrics reflect cumulative score, depth reached, and return distributions; the median score is the primary competition metric (Hambro et al., 2022, Hambro et al., 2022, Kurenkov et al., 2023).

5. Hierarchical, Skill-Based, and Exploration-Driven Methods

NLE’s vast state–action space has catalyzed the adoption of hierarchical RL, option-based architectures, and skill transfer approaches:

Hierarchical Kickstarting (HKS) fuses fixed expert subpolicies (skill options $S$ 9) via a learned selector $A$ 0, yielding a teacher-mixture $A$ 1 and cross-entropy regularization for action alignment (Matthews et al., 2022).
Hierarchical Behaviour Spaces (HBS) extend discrete options to continuous (or multi-bin) mixtures over $A$ 2 reward functions, where the controller outputs coefficient vectors $A$ 3 and optimizes a two-level semi-MDP with reward $A$ 4. Empirically, HBS unlocks superior exploration, multi-branch navigation, and asymptotic performance versus both flat PPO and scalable option learning (SOL) (Matthews et al., 27 Apr 2026).
Exploration-Targeted Intrinsic Rewards: Scout reward (+1/tile revealed), RND, and count-based methods are employed to break exploration bottlenecks where in-game score is too sparse (Küttler et al., 2020, Matthews et al., 27 Apr 2026).

Skill-centric benchmarks (SkillHack, MiniHack) use decomposable, parameterized levels and shaped skill curricula to test transfer, composition, and robust skill reuse, with curriculum generators in Python or NetHack’s des-file DSL (Samvelyan et al., 2021, Matthews et al., 2022).

6. Empirical Results and Scaling Laws

Despite the richness of the environment and datasets, all current neural agents remain far from subhuman performance:

Best pure RL (APPO, DQN): $A$ 5– $A$ 6 (mean score), rarely descending past level 5 (Küttler et al., 2020, Hambro et al., 2022).
BC from AutoAscend: $A$ 7 (Tuyls et al., 2023).
Hybrid APPO + BC or kickstarting: up to $A$ 8 in specific character tasks (Human Monk) with robust knowledge retention (BC/KS/EWC) during fine-tuning (Wołczyk et al., 2024).
Power-law scaling: Both imitation loss and episodic return exhibit smooth power laws with compute expenditure, supporting reliable forecasting and joint optimization over model/data size without phase transitions. However, biases in offline datasets and limited coverage severely bottleneck generalization and credit assignment in deep dungeons (Tuyls et al., 2023, Hambro et al., 2022).

Symbolic bots (AutoAscend, etc.) remain the only class to routinely achieve “Beginner” and higher ranks, with AutoAscend mean $A$ 9– $O$ 0; no neural agent has “ascended” a game under protocol (Hambro et al., 2022, Hambro et al., 2022).

7. Open Research Challenges and Future Directions

Major outstanding problems include:

Exploration and Sparse Reward: Architectural, intrinsic reward, and curriculum strategies to traverse the extreme combinatorial search space beyond surface-level play or degenerate score farming.
Hierarchical Planning and Skill Learning: Effective discovery, composition, and online adaptation of reusable skills/behaviours, with robust transfer across dungeon seeds and character classes (Matthews et al., 2022, Matthews et al., 27 Apr 2026).
Fine-tuning and Knowledge Retention: Preventing rapid deterioration of pre-trained capabilities due to catastrophic forgetting during on-policy RL (FPC), a phenomenon sharply amplified by NLE’s multi-level structure; using BC, kickstarting, and EWC to overcome these deficits (Wołczyk et al., 2024).
Offline RL Robustness: Addressing overfitting to dataset support, generalization across procedural levels, and learning from observation-only large-scale human data (Hambro et al., 2022, Kurenkov et al., 2023).
Representation Learning: Harnessing NetHack’s multimodal signals—symbolic, text, inventory, events—for deep sequence modeling and credit assignment.
Evaluation and Metrics: Refining success metrics to track mid-game progress, multi-objective performance (e.g., waypoints visitation, branching achievements), and systematic generalization.

Potential advances include: (i) integrating unsupervised skill discovery with human-guided benchmarks, (ii) leveraging LoRA, adapters, or modular architectures for parameter-efficient adaptation in sparse/recurrent domains, (iii) hybridizing symbolic planning with deep policy learning, and (iv) expanding curriculum and UED in the MiniHack sandbox for robust zero-shot transfer (Samvelyan et al., 2021, Matthews et al., 27 Apr 2026, Wołczyk et al., 2024).

NLE continues to serve as a foundational testbed for innovations in RL, imitation learning, and complex sequential decision making, with empirical progress systematically measured against one of the hardest, most intricate environments in the public domain.