Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages (2407.03321v1)

Published 3 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Many recent works have explored using LLMs for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a LLM might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate LLMs' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by LLMs by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight LLMs that reveal this task's complexity. For example, $87.6\%$ of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, $82.2\%$ are valid, solve-able problems, but only $35.1\%$ are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

PDF HTML Abstract

A Rigorous Benchmark for Translating Text to Structured Planning Languages

The paper introduces a novel benchmark, termed \benchmarkName{}, designed to evaluate the capacity of LLMs to generate Planning Domain Definition Language (PDDL) code from natural language descriptions of planning tasks. The benchmark addresses critical challenges in accurately measuring the quality of generated PDDL code, thus targeting a significant gap in current research methodologies.

Recent advances have shown promise for using LLMs to translate natural language descriptions into structured planning languages such as PDDL. However, accurately evaluating the quality of such translations has remained challenging due to two primary issues: the reliance on planning validators, which may pass valid but semantically incorrect PDDL code, and the lack of sufficient benchmarks with varied natural language descriptions and adequately challenging task sets. In response to these challenges, \benchmarkName{} proposes a more rigorous evaluation methodology and provides an extensive dataset for diverse and difficult text-to-PDDL translation tasks.

Key Components of \benchmarkName{}

Evaluation Framework

\benchmarkName{}'s evaluation framework includes a PDDL equivalence algorithm that ensures the correctness of generated PDDL by comparing it against a ground truth. This approach overcomes the limitations of conventional planning validators by precisely defining equivalence and implementing an efficient, automatic way of checking it. The evaluation framework operates by transforming PDDL code into scene graphs and performing isomorphism checks between these graphs. By doing so, it ensures that two PDDL problem formulations are deemed equivalent only if they represent the same underlying planning task, thereby providing a robust measure of semantic correctness.

Dataset Composition

The dataset comprises 132,037 text-to-PDDL pairs across 13 different tasks within the Blocks World and Gripper domains. Each pair includes a natural language description and its corresponding ground truth PDDL. The dataset varies along two main dimensions: abstractness (explicit vs. abstract descriptions) and problem size (number of propositions). These variations allow for a comprehensive evaluation of a model's ability to handle a wide range of scenarios from straightforward to highly complex tasks.

Numerical Results

The evaluation of several LLMs, including GPT-4o and open-weight models such as Mistral v0.3 7B Instruct and Gemma 1.1 IT 2B {content} 7B, revealed substantial differences in performance. GPT-4o demonstrated that while it could generate syntactically parseable PDDL code in 87.6% of cases and solve-able code in 82.2% of cases, only 35.1% of its outputs were semantically correct. This stark contrast underscores the inadequacies of current LLMs in generating accurate PDDL representations solely based on their syntactical and solution-verifiable properties.

Practical and Theoretical Implications

Practically, the implications of this research are substantial for deploying LLMs in environments requiring accurate translation of natural language to structured planning languages. Currently deployed systems could generate misleading plans if they do not adopt a stringent evaluation method like that proposed in \benchmarkName{}. Theoretically, this benchmark sets a higher standard for evaluating the semantic correctness of generated PDDL, encouraging future research to focus on improving the understanding and generation capabilities of LLMs with respect to structured planning languages.

Future Developments

The paper highlights several future directions, which include extending \benchmarkName{} to support more planning domains beyond Blocks World and Gripper, and to incorporate more expressive subsets of PDDL such as those accounting for non-deterministic, temporal, and numeric domains. Enhancing the benchmark to cover these areas would facilitate evaluating LLMs on more complex, real-world planning tasks, pushing the boundaries of LLM capabilities further.

Overall, \benchmarkName{} provides a comprehensive and rigorous benchmark to evaluate the translation of natural language descriptions to PDDL, offering significant advancements over existing methodologies. The research underscores the necessity for precision in evaluating the correctness of generated PDDL and sets a new benchmark for future developments in this domain.