MINE-1: Long-Horizon Planning Benchmark
- MINE-1 is a structured benchmark that evaluates long-horizon task planning in complex, resource-rich environments modeled after Minecraft.
- It consists of 45 distinct tasks across easy, medium, and hard levels, emphasizing navigation, resource management, and structured construction challenges.
- The benchmark rigorously tests planners using both propositional and numeric PDDL encodings, highlighting scalability constraints through comprehensive performance metrics.
MINE-1 is a systematically structured benchmark for long-horizon task planning in complex, spatially rich environments modelled after the Minecraft world. Designed to stress-test both propositional and numeric planners in scenarios featuring high-dimensional state spaces, nested resource dependencies, and deep multi-step objectives, MINE-1 comprises 45 distinct tasks—spanning low-level navigation, resource collection, and structured construction—encoded in both PDDL propositional and numeric representations. The benchmark exposes scalability bottlenecks and solution gaps in state-of-the-art planning systems, providing a foundation for research on hierarchical, geometric, and hybrid planning algorithms (Hill et al., 2023).
1. Formal Structure and PDDL Encodings
Each MINE-1 instance is formalized as a tuple , where:
- is a set of grounded fluents, including both Boolean predicates (e.g. block-present, item-present, at-x) and numeric functions (e.g. , agent-num-wood).
- is the finite set of parameterized actions, each specified by preconditions Pre and effects Eff.
- is the initial state.
- is a conjunction of fluents defining the goal condition.
- is a cost function, set globally as in MINE-1.
Valid plans sequentially satisfy under the operators’ preconditions and effects; plan length and search time are comprehensive performance measures. PDDL representations support both propositional (unary predicates and successor chains) and numeric (:functions and operations) encodings to accommodate different planner capabilities.
2. Task Types and Difficulty Hierarchy
MINE-1 encompasses 15 canonical task types, each instantiated at Easy, Medium, and Hard difficulty levels for comprehensive coverage. The tasks integrate primitive manipulation (e.g. move, pickup, break-block) with compound objectives (e.g. build-cabin, collect_and_build_shape), in both 2D and 3D spatial environments. Key challenge axes include:
- Spatial reasoning: navigation across or higher grids, staircase climbing
- Resource management: inventory constraints, multi-type block collection
- Construction sequencing: strict placement orderings, structural geometry
- Long-horizon planning: hundreds to thousands of required actions per instance
The general structure and objectives for each task type are summarized below.
| Task Type | Objective | Key Challenge |
|---|---|---|
| move | Reach a distant goal cell | Very long sequential horizon |
| gather_wood | Collect blocks from forest | Interleaved collect, navigate |
| build_wall | Construct block wall | Bulk placement actions |
| build_cabin | Chop, transport, and build cabin | Highly nested subtasks |
| collect_and_build_shape | Gather blocks and reproduce target shape | Two-phase plan, resource buildup |
This systematic expansion enables robustness evaluation across both atomic and composite task classes (Hill et al., 2023).
3. Instance Generation and Encoding Methodology
Parameterized domain templates with Python automation scripts systematically populate virtual worlds, randomly place trees, items, and obstacles, and bind all grounded objects (block, agent, position) via problem files. World parameters (grid size , tree height , block counts ) and difficulty flags control instance scaling.
- Propositional encoding: All quantities are modelled as unary predicates and “are-seq” successor relations. Resource counters appear as ; spatial transitions encode adjacent coordinates via .
- Numeric encoding: Quantities such as positions and inventory counts are :functions within PDDL; transitions utilize (increase …), (decrease …) operators. Numeric instances are compact but require planners supporting arithmetic fluents (e.g., ENHSP-20).
Representative operator definitions:
1 2 3 4 5 6 7 8 9 10 |
(:action move-north :parameters (?ag - agent) :precondition (and (agent-alive ?ag)) :effect (and (decrease (y ?ag) 1)) ) (:action break-grass_block-north :parameters (?ag - agent ?b - grass_block-block ?x ?y ?z_front - position ?n_start ?n_end - count) :precondition (and (at-x ?ag ?x) (at-y ?ag ?y) (at-z ?b ?z_front) (are-seq ?z_front ?z) (block-present ?b) (agent-has-n-grass_block ?ag ?n_start) (are-seq ?n_start ?n_end)) :effect (and (not (block-present ?b)) (not (agent-has-n-grass_block ?ag ?n_start)) (agent-has-n-grass_block ?ag ?n_end)) ) |
4. Evaluation Metrics and Experimental Protocol
Each benchmark task is evaluated on plan synthesis efficiency and solution optimality. For every problem:
- : PDDL to intermediate representation translation/grounding time (s)
- : Solution search time (ms)
- : Aggregate planning time
- : Plan length in actions
- : Peak memory usage (collected, not tabulated)
Results are averaged across five runs: , with planning time variance . A two-hour (7200 s) timeout is imposed; unsolved instances are assigned ms.
5. Planner Performance Analysis
Representative results in Table 1 (Hill et al., 2023) demonstrate scalability constraints in both propositional (Fast Downward) and numeric (ENHSP-20) planners. Key findings include:
- Fast Downward is effective for small or easy tasks but fails on medium difficulty due to grounding explosion in large .
- ENHSP-20 handles many easy and medium tasks but reliably times out on the largest construction and long-horizon tasks.
- No planner solves hard variants of the nontrivial task classes within the timeout, underscoring the limits of current classical planning heuristics.
| Task | Variant | FD Total (s) | ENHSP Total (s) |
|---|---|---|---|
| move | Easy | 41.9 | 20.4 |
| move | Medium | 237.7 | 317.2 |
| pickup | Easy | 341.96 | 20.52 |
| build_bridge | Easy | — | 7.2e6 (timeout) |
These results attest to the difficulties posed by high object counts, linear action horizons, and deeply nested objective decompositions in Minecraft-style domains.
6. Limitations and Directions for Future Research
Several structural challenges constrain the benchmark’s applicability and current planner effectiveness:
- Propositional encodings suffer from combinatorial grounding explosion, typically limiting tractable grids to .
- Numeric planners accommodate fluent counts but cannot scale to instances with hundreds of thousands of objects.
- Tasks featuring long, linear horizons (e.g., move-1000, place-500) are intractable for existing heuristics.
Promising research avenues identified by MINE-1 include:
- Hierarchical and subgoal decomposition to break deep tasks into tractable subtasks.
- Macro-actions for sequence abstraction (e.g., “move-and-pickup-sequence”).
- 3D geometric-sensitive heuristics (e.g., landmark graphs).
- Integration of learning-based priors from agent demonstrations.
A plausible implication is that breakthroughs in scalable, hybrid, and hierarchical planners—potentially leveraging sampling, policy learning, or subgoal graphs—will be necessary for progress beyond present system boundaries.
7. Benchmark Significance and Impact
MINE-1 establishes a robust, reproducible foundation for empirical evaluation of planning systems under long-horizon, resource-rich, spatial-combinatorial conditions. It highlights critical scalability gaps in both symbolic and numeric planners and provides systematic diagnostics to guide research on encoding representations, efficient grounding, hierarchical decomposition, and integration of continuous learning frameworks (Hill et al., 2023). As such, MINE-1 is an indispensable reference for principal investigator groups in automated planning, reinforcement learning, and embodied AI research targeting open-world task synthesis and execution.