- The paper introduces a task for reconstructing PDDL domains from natural language and establishes an evaluation framework based on ground truth models.
- It presents two automated metrics, Action Reconstruction Error and Heuristic Domain Equivalence, to objectively assess generated domain quality.
- Empirical evaluations over 9 domains and 7 LLMs reveal that larger models like LLaMA-2-70b generate syntactically and semantically valid PDDL constructs.
LLMs as Planning Domain Generators
The paper under review explores the potential of LLMs to generate planning domain models from natural language descriptions. Planning domain generation is a key activity within AI planning that traditionally demands substantial human input. Automating this task can enhance the accessibility and application of AI planning frameworks. The authors propose and investigate a novel framework, automating the evaluation of LLM-generated domains against a set of ground truth domain models.
Key Contributions and Methodology
The authors provide four main contributions to the field:
- Definition of PDDL Domain Reconstruction Task: The authors delineate a task involving the reconstruction of Planning Domain Definition Language (PDDL) domains from natural language, relying on a reference "ground truth" for evaluation. This task aims at high-quality domain reconstruction that aligns closely with established domain models.
- Metrics for Evaluation: The introduction of two automated metrics to assess the quality of generated domains without the need for subjective human evaluation. These are:
- Action Reconstruction Error (ARE) measures differences between predicates in the original and generated actions.
- Heuristic Domain Equivalence involves plan applicability checks for plans generated within the original domain to validate the reconstructed domain's equivalence.
- Classes of Natural Language Descriptions: The paper investigates the effect of different types of natural language descriptions on the quality of the generated domains, ranging from base descriptions to more detailed ones including specific predicates.
- Empirical Evaluation: A comprehensive empirical analysis involving 7 LLMs (including coding and chat models) evaluated over 9 distinct planning domains, with each domain described in three different natural language classes.
Results and Observations
The authors observed that LLMs, particularly those with larger parameter counts, demonstrate moderate proficiency in correctly generating planning domains from natural language descriptions. Models like LLaMA-2-70b have shown promising results, generating syntactically and semantically valid PDDL constructs in a significant portion of cases. However, the inherent complexity within domain model translation from natural language still poses challenges, evidenced by variations in reconstruction quality across different LLMs and description types.
Implications and Future Directions
This investigation into LLMs for domain generation has both practical and theoretical implications. Practically, it suggests a potent avenue for reducing dependency on technical expertise in domain modeling, potentially enabling more widespread application of AI planning across varied industries. Furthermore, the study underscores the importance of model selection, tuning, and handling of natural language prompts in enhancing domain generation processes.
Theoretically, these findings contribute to the understanding of how LLMs can bridge the gap between natural language understanding and symbolic AI, suggesting a feasible path forward for hybrid AI systems that leverage strengths from both paradigms. Future work could focus on refining evaluation strategies, exploring model tuning approaches, and refining natural language prompts to further improve domain generation quality. Additionally, re-prompting and corrective mechanisms could be employed to iteratively enhance the accuracy of generated domain models.
In summary, the research presents a compelling case for the integration of LLMs into AI planning tasks, offering insights into both the capabilities of LLMs and the nuances involved in translating open-ended natural language into structured domain models.