MINE-1: Long-Horizon Planning Benchmark

Updated 3 December 2025

MINE-1 is a structured benchmark that evaluates long-horizon task planning in complex, resource-rich environments modeled after Minecraft.
It consists of 45 distinct tasks across easy, medium, and hard levels, emphasizing navigation, resource management, and structured construction challenges.
The benchmark rigorously tests planners using both propositional and numeric PDDL encodings, highlighting scalability constraints through comprehensive performance metrics.

MINE-1 is a systematically structured benchmark for long-horizon task planning in complex, spatially rich environments modelled after the Minecraft world. Designed to stress-test both propositional and numeric planners in scenarios featuring high-dimensional state spaces, nested resource dependencies, and deep multi-step objectives, MINE-1 comprises 45 distinct tasks—spanning low-level navigation, resource collection, and structured construction—encoded in both PDDL propositional and numeric representations. The benchmark exposes scalability bottlenecks and solution gaps in state-of-the-art planning systems, providing a foundation for research on hierarchical, geometric, and hybrid planning algorithms (Hill et al., 2023).

1. Formal Structure and PDDL Encodings

Each MINE-1 instance is formalized as a tuple $P = \langle F, A, s_0, G, c \rangle$ , where:

$F$ is a set of grounded fluents, including both Boolean predicates (e.g. block-present $(b)$ , item-present $(i)$ , at-x $(\text{agent},x)$ ) and numeric functions (e.g. $(x\,\text{agent})$ , agent-num-wood $(\text{agent})$ ).
$A$ is the finite set of parameterized actions, each specified by preconditions Pre $(\alpha)$ and effects Eff $(\alpha)$ .
$s_0 \in 2^F$ is the initial state.
$G$ is a conjunction of fluents defining the goal condition.
$c:A\rightarrow\mathbb{R}_+$ is a cost function, set globally as $c(\alpha)=1\,\forall\alpha$ in MINE-1.

Valid plans $\pi = \langle \alpha_1, \ldots, \alpha_k \rangle$ sequentially satisfy $G$ under the operators’ preconditions and effects; plan length $L(\pi)=k$ and search time $T_\text{search}$ are comprehensive performance measures. PDDL representations support both propositional (unary predicates and successor chains) and numeric (:functions and operations) encodings to accommodate different planner capabilities.

2. Task Types and Difficulty Hierarchy

MINE-1 encompasses 15 canonical task types, each instantiated at Easy, Medium, and Hard difficulty levels for comprehensive coverage. The tasks integrate primitive manipulation (e.g. move, pickup, break-block) with compound objectives (e.g. build-cabin, collect_and_build_shape), in both 2D and 3D spatial environments. Key challenge axes include:

Spatial reasoning: navigation across $N\times N$ or higher grids, staircase climbing
Resource management: inventory constraints, multi-type block collection
Construction sequencing: strict placement orderings, structural geometry
Long-horizon planning: hundreds to thousands of required actions per instance

The general structure and objectives for each task type are summarized below.

Task Type	Objective	Key Challenge
move	Reach a distant goal cell	Very long sequential horizon
gather_wood	Collect $K$ blocks from forest	Interleaved collect, navigate
build_wall	Construct $W{\times}H$ block wall	Bulk placement actions
build_cabin	Chop, transport, and build cabin	Highly nested subtasks
collect_and_build_shape	Gather blocks and reproduce target shape	Two-phase plan, resource buildup

This systematic expansion enables robustness evaluation across both atomic and composite task classes (Hill et al., 2023).

3. Instance Generation and Encoding Methodology

Parameterized domain templates with Python automation scripts systematically populate virtual $N \times N \times H$ worlds, randomly place trees, items, and obstacles, and bind all grounded objects (block, agent, position) via problem files. World parameters (grid size $N$ , tree height $H$ , block counts $K$ ) and difficulty flags control instance scaling.

Propositional encoding: All quantities are modelled as unary predicates and “are-seq” successor relations. Resource counters appear as $(\text{agent-has-n-grass\_block}\ \text{agent}\ n)$ ; spatial transitions encode adjacent coordinates via $(\text{are-seq}\ m\ n)$ .
Numeric encoding: Quantities such as positions and inventory counts are :functions within PDDL; transitions utilize (increase …), (decrease …) operators. Numeric instances are compact but require planners supporting arithmetic fluents (e.g., ENHSP-20).

Representative operator definitions:

(:action move-north
 :parameters (?ag - agent)
 :precondition (and (agent-alive ?ag))
 :effect (and (decrease (y ?ag) 1))
)
(:action break-grass_block-north
 :parameters (?ag - agent ?b - grass_block-block ?x ?y ?z_front - position ?n_start ?n_end - count)
 :precondition (and (at-x ?ag ?x) (at-y ?ag ?y) (at-z ?b ?z_front) (are-seq ?z_front ?z) (block-present ?b) (agent-has-n-grass_block ?ag ?n_start) (are-seq ?n_start ?n_end))
 :effect (and (not (block-present ?b)) (not (agent-has-n-grass_block ?ag ?n_start)) (agent-has-n-grass_block ?ag ?n_end))
)

This approach enables scalable and reproducible instance generation, critical for empirical planning system evaluation.

4. Evaluation Metrics and Experimental Protocol

Each benchmark task is evaluated on plan synthesis efficiency and solution optimality. For every problem:

$T_\text{trans}$ : PDDL to intermediate representation translation/grounding time (s)
$T_\text{search}$ : Solution search time (ms)
$T_\text{total}$ : Aggregate planning time
$L$ : Plan length in actions
$M$ : Peak memory usage (collected, not tabulated)

Results are averaged across five runs: $\bar T_\text{total} = \frac{1}{5} \sum_{i=1}^5 T^{(i)}_{\text{total}}$ , with planning time variance $\sigma_T$ . A two-hour (7200 s) timeout is imposed; unsolved instances are assigned $T = 7.2{\times}10^6$ ms.

5. Planner Performance Analysis

Representative results in Table 1 (Hill et al., 2023) demonstrate scalability constraints in both propositional (Fast Downward) and numeric (ENHSP-20) planners. Key findings include:

Fast Downward is effective for small or easy tasks but fails on medium difficulty due to grounding explosion in large $N$ .
ENHSP-20 handles many easy and medium tasks but reliably times out on the largest construction and long-horizon tasks.
No planner solves hard variants of the nontrivial task classes within the timeout, underscoring the limits of current classical planning heuristics.

Task	Variant	FD Total (s)	ENHSP Total (s)
move	Easy	41.9	20.4
move	Medium	237.7	317.2
pickup	Easy	341.96	20.52
build_bridge	Easy	—	7.2e6 (timeout)

These results attest to the difficulties posed by high object counts, linear action horizons, and deeply nested objective decompositions in Minecraft-style domains.

6. Limitations and Directions for Future Research

Several structural challenges constrain the benchmark’s applicability and current planner effectiveness:

Propositional encodings suffer from combinatorial grounding explosion, typically limiting tractable grids to $N \lesssim 50$ .
Numeric planners accommodate fluent counts but cannot scale to instances with hundreds of thousands of objects.
Tasks featuring long, linear horizons (e.g., move-1000, place-500) are intractable for existing heuristics.

Promising research avenues identified by MINE-1 include:

Hierarchical and subgoal decomposition to break deep tasks into tractable subtasks.
Macro-actions for sequence abstraction (e.g., “move-and-pickup-sequence”).
3D geometric-sensitive heuristics (e.g., landmark graphs).
Integration of learning-based priors from agent demonstrations.

A plausible implication is that breakthroughs in scalable, hybrid, and hierarchical planners—potentially leveraging sampling, policy learning, or subgoal graphs—will be necessary for progress beyond present system boundaries.

7. Benchmark Significance and Impact

MINE-1 establishes a robust, reproducible foundation for empirical evaluation of planning systems under long-horizon, resource-rich, spatial-combinatorial conditions. It highlights critical scalability gaps in both symbolic and numeric planners and provides systematic diagnostics to guide research on encoding representations, efficient grounding, hierarchical decomposition, and integration of continuous learning frameworks (Hill et al., 2023). As such, MINE-1 is an indispensable reference for principal investigator groups in automated planning, reinforcement learning, and embodied AI research targeting open-world task synthesis and execution.

PDF Markdown Chat (Pro)

References (1)

MinePlanner: A Benchmark for Long-Horizon Planning in Large Minecraft Worlds (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MINE-1 Benchmark.