- The paper introduces AutoToS, a methodology that automates the extraction of search components using LLMs and iterative unit tests.
- It details a process involving goal soundness and successor function validation, demonstrating significant accuracy improvements across five varied search problems.
- Experimental results reveal reduced LLM calls and enhanced performance, paving the way for more autonomous and reliable planning systems.
Automating Thought of Search: A Journey Towards Soundness and Completeness
Overview
The paper "Automating Thought of Search: A Journey Towards Soundness and Completeness" by Daniel Cao et al. presents a significant step forward in leveraging LLMs for automating planning tasks by proposing a methodology that removes human involvement from the loop of generating sound and complete search components. This work builds upon the previous concept of Thought of Search (ToS), which utilized LLMs to define search spaces. However, ToS required human intervention to ensure the soundness and completeness of the search components.
Methodology
The authors introduce AutoToS to automate the extraction of search components with minimal human interaction. AutoToS guides LLMs step-by-step toward producing these components through structured feedback derived from both generic and domain-specific unit tests. The primary stages of AutoToS implementation include:
- Initial Prompting: Obtaining raw versions of the search components (successor function and goal test) from the LLMs.
- Unit Testing for Goal Soundness: Ensuring that goal states are correctly identified with minimal iterations.
- Successor Function Soundness Check: Verifying the soundness of successor functions using a breadth-first or depth-first search algorithm extended with timeout checks and transition validity.
- (Optional) Successor Function Completeness Check: Enhancing completeness by checking generated successors against a known set of correct transitions.
The approach leverages both generic and domain-specific tests to provide feedback to the LLMs, correcting the generated code iteratively until all unit tests pass or a maximum number of iterations are reached.
Experimental Evaluation
The authors tested AutoToS across five distinct search problems: BlocksWorld, PrOntoQA, Mini Crossword, 24 Game, and Sokoban. Various LLMs from different families were used, including GPT-4o, GPT-4o-Mini, Llama3.1-70b, Llama3.1-405b, and DeepSeek-CoderV2. The performance was evaluated on several benchmarks with up to 19 calls to the LLM per domain.
The authors verified whether unit tests for soundness and completeness significantly improved the code generated by the LLMs, with results indicating substantial improvements, particularly when the partial soundness test was employed. Notably, using unit tests for goal soundness and successor completeness led to a marked increase in accuracy across all tested domains.
Numerical Results
The experiments showed that the average number of calls to the LLM per domain ranged from 2.0 to 10.0 across the models tested (see Table 1). Each experiment was repeated five times, confirming consistency in performance. Notably, the accuracy improved significantly when the models transitioned from raw function generation to unit-tested iterations (see Figure 1). The paper also highlighted the importance of integrating partial soundness checks, which substantially increased final accuracy levels.
Error Analysis and Future Directions
An analysis of the errors from the generated code revealed that different models exhibited varied error patterns (see Figure 2). The paper discusses "bloopers" or interesting error phenomena, like the misunderstanding of state representations in the 24 Game and BlocksWorld, which showcases areas for further refinement.
Potential future explorations proposed in the paper include the automation of unit and partial soundness tests generation, possibly by leveraging LLMs themselves. Another direction could be exploring the use of LLMs for deriving invariants, a concept central to planning problems, to further enhance the soundness and completeness of generated code.
Implications
The implications of this research are impactful for both theoretical understanding and practical applications within the field of AI planning and search. The approach promises to democratize access to advanced planning techniques by reducing the need for human expertise in feedback loops, thereby speeding up the development and deployment of planning solutions. In future AI systems, automated generation and validation of search components can lead to more reliable and efficient problem-solving mechanisms, potentially applicable across a broad spectrum of domains from robotics to complex decision-making systems.