Papers
Topics
Authors
Recent
2000 character limit reached

MINE-1: Long-Horizon Planning Benchmark

Updated 3 December 2025
  • MINE-1 is a structured benchmark that evaluates long-horizon task planning in complex, resource-rich environments modeled after Minecraft.
  • It consists of 45 distinct tasks across easy, medium, and hard levels, emphasizing navigation, resource management, and structured construction challenges.
  • The benchmark rigorously tests planners using both propositional and numeric PDDL encodings, highlighting scalability constraints through comprehensive performance metrics.

MINE-1 is a systematically structured benchmark for long-horizon task planning in complex, spatially rich environments modelled after the Minecraft world. Designed to stress-test both propositional and numeric planners in scenarios featuring high-dimensional state spaces, nested resource dependencies, and deep multi-step objectives, MINE-1 comprises 45 distinct tasks—spanning low-level navigation, resource collection, and structured construction—encoded in both PDDL propositional and numeric representations. The benchmark exposes scalability bottlenecks and solution gaps in state-of-the-art planning systems, providing a foundation for research on hierarchical, geometric, and hybrid planning algorithms (Hill et al., 2023).

1. Formal Structure and PDDL Encodings

Each MINE-1 instance is formalized as a tuple P=F,A,s0,G,cP = \langle F, A, s_0, G, c \rangle, where:

  • FF is a set of grounded fluents, including both Boolean predicates (e.g. block-present(b)(b), item-present(i)(i), at-x(agent,x)(\text{agent},x)) and numeric functions (e.g. (xagent)(x\,\text{agent}), agent-num-wood(agent)(\text{agent})).
  • AA is the finite set of parameterized actions, each specified by preconditions Pre(α)(\alpha) and effects Eff(α)(\alpha).
  • s02Fs_0 \in 2^F is the initial state.
  • GG is a conjunction of fluents defining the goal condition.
  • c:AR+c:A\rightarrow\mathbb{R}_+ is a cost function, set globally as c(α)=1αc(\alpha)=1\,\forall\alpha in MINE-1.

Valid plans π=α1,,αk\pi = \langle \alpha_1, \ldots, \alpha_k \rangle sequentially satisfy GG under the operators’ preconditions and effects; plan length L(π)=kL(\pi)=k and search time TsearchT_\text{search} are comprehensive performance measures. PDDL representations support both propositional (unary predicates and successor chains) and numeric (:functions and operations) encodings to accommodate different planner capabilities.

2. Task Types and Difficulty Hierarchy

MINE-1 encompasses 15 canonical task types, each instantiated at Easy, Medium, and Hard difficulty levels for comprehensive coverage. The tasks integrate primitive manipulation (e.g. move, pickup, break-block) with compound objectives (e.g. build-cabin, collect_and_build_shape), in both 2D and 3D spatial environments. Key challenge axes include:

  • Spatial reasoning: navigation across N×NN\times N or higher grids, staircase climbing
  • Resource management: inventory constraints, multi-type block collection
  • Construction sequencing: strict placement orderings, structural geometry
  • Long-horizon planning: hundreds to thousands of required actions per instance

The general structure and objectives for each task type are summarized below.

Task Type Objective Key Challenge
move Reach a distant goal cell Very long sequential horizon
gather_wood Collect KK blocks from forest Interleaved collect, navigate
build_wall Construct W×HW{\times}H block wall Bulk placement actions
build_cabin Chop, transport, and build cabin Highly nested subtasks
collect_and_build_shape Gather blocks and reproduce target shape Two-phase plan, resource buildup

This systematic expansion enables robustness evaluation across both atomic and composite task classes (Hill et al., 2023).

3. Instance Generation and Encoding Methodology

Parameterized domain templates with Python automation scripts systematically populate virtual N×N×HN \times N \times H worlds, randomly place trees, items, and obstacles, and bind all grounded objects (block, agent, position) via problem files. World parameters (grid size NN, tree height HH, block counts KK) and difficulty flags control instance scaling.

  • Propositional encoding: All quantities are modelled as unary predicates and “are-seq” successor relations. Resource counters appear as (agent-has-n-grass_block agent n)(\text{agent-has-n-grass\_block}\ \text{agent}\ n); spatial transitions encode adjacent coordinates via (are-seq m n)(\text{are-seq}\ m\ n).
  • Numeric encoding: Quantities such as positions and inventory counts are :functions within PDDL; transitions utilize (increase …), (decrease …) operators. Numeric instances are compact but require planners supporting arithmetic fluents (e.g., ENHSP-20).

Representative operator definitions:

1
2
3
4
5
6
7
8
9
10
(:action move-north
 :parameters (?ag - agent)
 :precondition (and (agent-alive ?ag))
 :effect (and (decrease (y ?ag) 1))
)
(:action break-grass_block-north
 :parameters (?ag - agent ?b - grass_block-block ?x ?y ?z_front - position ?n_start ?n_end - count)
 :precondition (and (at-x ?ag ?x) (at-y ?ag ?y) (at-z ?b ?z_front) (are-seq ?z_front ?z) (block-present ?b) (agent-has-n-grass_block ?ag ?n_start) (are-seq ?n_start ?n_end))
 :effect (and (not (block-present ?b)) (not (agent-has-n-grass_block ?ag ?n_start)) (agent-has-n-grass_block ?ag ?n_end))
)
This approach enables scalable and reproducible instance generation, critical for empirical planning system evaluation.

4. Evaluation Metrics and Experimental Protocol

Each benchmark task is evaluated on plan synthesis efficiency and solution optimality. For every problem:

  • TtransT_\text{trans}: PDDL to intermediate representation translation/grounding time (s)
  • TsearchT_\text{search}: Solution search time (ms)
  • TtotalT_\text{total}: Aggregate planning time
  • LL: Plan length in actions
  • MM: Peak memory usage (collected, not tabulated)

Results are averaged across five runs: Tˉtotal=15i=15Ttotal(i)\bar T_\text{total} = \frac{1}{5} \sum_{i=1}^5 T^{(i)}_{\text{total}}, with planning time variance σT\sigma_T. A two-hour (7200 s) timeout is imposed; unsolved instances are assigned T=7.2×106T = 7.2{\times}10^6 ms.

5. Planner Performance Analysis

Representative results in Table 1 (Hill et al., 2023) demonstrate scalability constraints in both propositional (Fast Downward) and numeric (ENHSP-20) planners. Key findings include:

  • Fast Downward is effective for small or easy tasks but fails on medium difficulty due to grounding explosion in large NN.
  • ENHSP-20 handles many easy and medium tasks but reliably times out on the largest construction and long-horizon tasks.
  • No planner solves hard variants of the nontrivial task classes within the timeout, underscoring the limits of current classical planning heuristics.
Task Variant FD Total (s) ENHSP Total (s)
move Easy 41.9 20.4
move Medium 237.7 317.2
pickup Easy 341.96 20.52
build_bridge Easy 7.2e6 (timeout)

These results attest to the difficulties posed by high object counts, linear action horizons, and deeply nested objective decompositions in Minecraft-style domains.

6. Limitations and Directions for Future Research

Several structural challenges constrain the benchmark’s applicability and current planner effectiveness:

  • Propositional encodings suffer from combinatorial grounding explosion, typically limiting tractable grids to N50N \lesssim 50.
  • Numeric planners accommodate fluent counts but cannot scale to instances with hundreds of thousands of objects.
  • Tasks featuring long, linear horizons (e.g., move-1000, place-500) are intractable for existing heuristics.

Promising research avenues identified by MINE-1 include:

  • Hierarchical and subgoal decomposition to break deep tasks into tractable subtasks.
  • Macro-actions for sequence abstraction (e.g., “move-and-pickup-sequence”).
  • 3D geometric-sensitive heuristics (e.g., landmark graphs).
  • Integration of learning-based priors from agent demonstrations.

A plausible implication is that breakthroughs in scalable, hybrid, and hierarchical planners—potentially leveraging sampling, policy learning, or subgoal graphs—will be necessary for progress beyond present system boundaries.

7. Benchmark Significance and Impact

MINE-1 establishes a robust, reproducible foundation for empirical evaluation of planning systems under long-horizon, resource-rich, spatial-combinatorial conditions. It highlights critical scalability gaps in both symbolic and numeric planners and provides systematic diagnostics to guide research on encoding representations, efficient grounding, hierarchical decomposition, and integration of continuous learning frameworks (Hill et al., 2023). As such, MINE-1 is an indispensable reference for principal investigator groups in automated planning, reinforcement learning, and embodied AI research targeting open-world task synthesis and execution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MINE-1 Benchmark.