A Rigorous Benchmark for Translating Text to Structured Planning Languages
The paper introduces a novel benchmark, termed \benchmarkName{}, designed to evaluate the capacity of LLMs to generate Planning Domain Definition Language (PDDL) code from natural language descriptions of planning tasks. The benchmark addresses critical challenges in accurately measuring the quality of generated PDDL code, thus targeting a significant gap in current research methodologies.
Recent advances have shown promise for using LLMs to translate natural language descriptions into structured planning languages such as PDDL. However, accurately evaluating the quality of such translations has remained challenging due to two primary issues: the reliance on planning validators, which may pass valid but semantically incorrect PDDL code, and the lack of sufficient benchmarks with varied natural language descriptions and adequately challenging task sets. In response to these challenges, \benchmarkName{} proposes a more rigorous evaluation methodology and provides an extensive dataset for diverse and difficult text-to-PDDL translation tasks.
Key Components of \benchmarkName{}
Evaluation Framework
\benchmarkName{}'s evaluation framework includes a PDDL equivalence algorithm that ensures the correctness of generated PDDL by comparing it against a ground truth. This approach overcomes the limitations of conventional planning validators by precisely defining equivalence and implementing an efficient, automatic way of checking it. The evaluation framework operates by transforming PDDL code into scene graphs and performing isomorphism checks between these graphs. By doing so, it ensures that two PDDL problem formulations are deemed equivalent only if they represent the same underlying planning task, thereby providing a robust measure of semantic correctness.
Dataset Composition
The dataset comprises 132,037 text-to-PDDL pairs across 13 different tasks within the Blocks World and Gripper domains. Each pair includes a natural language description and its corresponding ground truth PDDL. The dataset varies along two main dimensions: abstractness (explicit vs. abstract descriptions) and problem size (number of propositions). These variations allow for a comprehensive evaluation of a model's ability to handle a wide range of scenarios from straightforward to highly complex tasks.
Numerical Results
The evaluation of several LLMs, including GPT-4o and open-weight models such as Mistral v0.3 7B Instruct and Gemma 1.1 IT 2B {content} 7B, revealed substantial differences in performance. GPT-4o demonstrated that while it could generate syntactically parseable PDDL code in 87.6% of cases and solve-able code in 82.2% of cases, only 35.1% of its outputs were semantically correct. This stark contrast underscores the inadequacies of current LLMs in generating accurate PDDL representations solely based on their syntactical and solution-verifiable properties.
Practical and Theoretical Implications
Practically, the implications of this research are substantial for deploying LLMs in environments requiring accurate translation of natural language to structured planning languages. Currently deployed systems could generate misleading plans if they do not adopt a stringent evaluation method like that proposed in \benchmarkName{}. Theoretically, this benchmark sets a higher standard for evaluating the semantic correctness of generated PDDL, encouraging future research to focus on improving the understanding and generation capabilities of LLMs with respect to structured planning languages.
Future Developments
The paper highlights several future directions, which include extending \benchmarkName{} to support more planning domains beyond Blocks World and Gripper, and to incorporate more expressive subsets of PDDL such as those accounting for non-deterministic, temporal, and numeric domains. Enhancing the benchmark to cover these areas would facilitate evaluating LLMs on more complex, real-world planning tasks, pushing the boundaries of LLM capabilities further.
Overall, \benchmarkName{} provides a comprehensive and rigorous benchmark to evaluate the translation of natural language descriptions to PDDL, offering significant advancements over existing methodologies. The research underscores the necessity for precision in evaluating the correctness of generated PDDL and sets a new benchmark for future developments in this domain.