MineRL Dataset Overview

Updated 12 September 2025

MineRL Dataset is a large-scale collection of annotated Minecraft gameplay trajectories, combining state–action–reward tuples for hierarchical reinforcement learning research.
It leverages diverse human demonstrations to address sparse-reward challenges, enhancing sample efficiency in deep RL through imitation learning and hybrid approaches.
The modular framework, complete with filtering APIs and standardized formats, enables reproducible benchmarking and integration with OpenAI Gym and Dockerized workflows.

The MineRL dataset is a large-scale, richly annotated collection of sequential state–action–reward trajectories obtained from human gameplay in the open-world, procedurally generated, 3D environment of Minecraft. Developed to address sample inefficiency in deep reinforcement learning (RL), MineRL provides over 60 million state–action pairs—contained within a modular, extensible framework—covering a spectrum of complex, hierarchical, long-horizon tasks. This resource enables the development and benchmarking of algorithms that leverage human priors to dramatically reduce the number of environment samples necessary to solve sparse-reward decision-making problems.

1. Dataset Composition and Structure

The MineRL dataset (“MineRL-v0”) consists of more than 60 million state–action–(reward) tuples collected from human demonstrations across seven distinct Minecraft environments (Guss et al., 2019, Guss et al., 2019). Trajectories are sampled at a fixed rate of 20 ticks per second and encode:

State: Each tick contains a 64×64 RGB first-person image, game-state features (inventory, collection events, objective distances, health, level, achievements), and contextual GUI information.
Action: Actions span continuous camera movements (pitch/yaw), discrete movement commands (forward/backward/left/right), GUI interactions, mining, crafting, smelting, and block placement.
Rewards and Annotations: Timestamped rewards are provided (dependant on task), and extensive automatic annotations mark subtask milestones, item events, deaths/no-ops, and hierarchical progress within Minecraft’s item graph (comprising 371+ unique items).

Data is distributed as easily consumable NumPy .npz files; an API enables targeted filtering (e.g., selection by expertise, trajectory length, or subtask achievement).

2. Hierarchical and Multimodal Task Representation

The hierarchical nature of Minecraft tasks is systematically encoded. Key challenges such as “ObtainDiamond” require agents to complete multi-stage prerequisite chains (e.g., wood→planks→sticks→crafting table→wooden pickaxe→stone→furnace→iron→iron pickaxe→diamond) (Guss et al., 2019, Guss et al., 2019). Hierarchical labels and precedence graphs automatically extracted from human gameplay support structured RL methodologies, credit assignment, and decomposition. Complex state-action spaces are preserved in full richness, but alternate filtered versions (e.g., with simplified discrete primitives) are made available for benchmarking RL variants.

3. Human Demonstrations and Priors

MineRL relies on human demonstrations of diverse expertise, from novice to expert, forming a core “prior” that facilitates imitation learning, behavioral cloning, and hybrid RL approaches. Demonstrations were systematically gathered through a custom client plugin intercepting all low-level Minecraft communications, yielding full resimulatable game state trajectories and flexible rendering (e.g., enabling texture or lighting variation) (Guss et al., 2019). Datasets capture variability, demonstrating both optimal and suboptimal human strategies.

Empirical results with DQN variants pre-initialized with expert demonstrations (PreDQN) confirm substantial improvements in sample efficiency, with agents leveraging these priors to overcome sparse rewards and exploration bottlenecks. The breadth across hundreds of simulated hours and thousands of trajectories supports robust analysis and generalization research.

4. Benchmarking, Competitions, and Evaluation Protocols

MineRL supports competitions designed for sample-efficient RL, including the NeurIPS 2019/2020 MineRL Challenges (Guss et al., 2019, Milani et al., 2020, Guss et al., 2021). Competitions are structured in two rounds:

Round 1: Use paired dataset–environment versions (with varied visuals) for development, enforcing strict limits on environment samples (e.g., 8 million interactions) and computational resources.
Round 2: Finalists train agents from scratch—containerized for reproducibility—on held-out, altered dataset–environment pairs (unseen textures, action-space desemantization).
Evaluation: Agents are scored by summed milestone rewards over up to 500 episodes, with reward structures defined hierarchically (see Table below for reward mapping):

Subtask	Reward Value
log	1
planks	2
stick	4
crafting_table	4
wooden_pickaxe	8
stone	16
furnace/stone_pickaxe	32
iron_ore	64
iron_ingot	128
iron_pickaxe	256
diamond	1024

Containerized submissions and standardized simulation are enforced using OpenAI Gym-compatible interfaces and Dockerization (via repo2docker), supporting reproducibility and fair cross-team comparison.

5. Sample Efficiency and Algorithmic Innovations

A central innovation is enforcing strict sample efficiency—a requirement for tractable real-world deployment. The dataset catalyzes RL methods integrated with imitation learning:

PreDQN: Replay buffer is prepopulated with expert demonstrations, using standard Bellman updates:

$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right].$

Policies initialized with human priors converge faster under sparse-reward conditions and long planning horizons.

Hierarchical DQNs (HDQfD): Leverage structured buffers with adaptive prioritization, decomposing tasks into meta-actions via inventory events, and gradually shifting weight from imperfect demonstrations to agent-generated rollouts (Skrynnik et al., 2019).
Hybrid Approaches: Engineering solutions combine behavioral cloning, reward modeling by inverse RL, and hierarchical decomposition (using subtask labels to train temporally abstracted options or meta-controllers).

Studies highlight superior sample efficiency and stable learning compared to traditional RL (DQN, PPO, A3C) trained from scratch; forgetting mechanisms and adaptive demonstration reweighting mitigate suboptimal demonstration data.

6. Technical and Practical Considerations

The dataset’s modular design supports extensibility—new tasks (e.g., survival, navigation, item acquisition) can be added dynamically, leveraging the packet-level recording for full re-simulation and synthetic variation (Guss et al., 2019). State representation is denoted as $s_t = (I_t, F_t)$ , capturing both perceptual and symbolic game features.

The OpenAI Gym API and formatted .npz/JSON/MP4 data packaging facilitate rapid integration with standardized RL and imitation learning frameworks. Action-space obfuscation and texture randomization inoculate agents against overfitting to demonstration specifics, promoting domain-agnostic method development.

7. Applications, Limitations, and Future Directions

MineRL is positioned as a central resource for research in sample-efficient RL, imitation learning, hierarchical RL, and inverse RL, with additional implications for explainable AI and robotic planning. By supporting challenging, hierarchical, sparse-reward tasks, it drives research on long-horizon credit assignment and robustness under variable perceptual conditions.

Limitations include the high-dimensionality and class imbalance of actions, and the potential for distributional shift due to sequential video sampling. Future directions involve more sophisticated demonstration sampling, hybrid RL–BC integration, reward modeling from human feedback, and expandability to multi-agent and real-world transfer scenarios.

In summary, the MineRL dataset provides unprecedented scale, structure, and annotation richness for sequential decision-making research. By leveraging human demonstrations, hierarchical task encoding, and rigorous sample-efficiency benchmarks, it enables the advancement and democratization of high-sample-complexity RL research in environments exhibiting real-world challenges.