Automating Thought of Search: A Journey Towards Soundness and Completeness (2408.11326v2)

Published 21 Aug 2024 in cs.AI

Abstract: LLMs are being used to solve planning problems that require search. Most of the literature uses LLMs as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having LLMs produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. Consequently, there is great potential to automate the ToS process. We take a first major step towards automating ToS (AutoToS), taking the human out of the loop of interactions with the LLM. AutoToS guides the LLM step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We show that AutoToS is able to achieve 100% accuracy on all the evaluated domains with a small number of LLM calls.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces AutoToS, a methodology that automates the extraction of search components using LLMs and iterative unit tests.
It details a process involving goal soundness and successor function validation, demonstrating significant accuracy improvements across five varied search problems.
Experimental results reveal reduced LLM calls and enhanced performance, paving the way for more autonomous and reliable planning systems.

Automating Thought of Search: A Journey Towards Soundness and Completeness

Overview

The paper "Automating Thought of Search: A Journey Towards Soundness and Completeness" by Daniel Cao et al. presents a significant step forward in leveraging LLMs for automating planning tasks by proposing a methodology that removes human involvement from the loop of generating sound and complete search components. This work builds upon the previous concept of Thought of Search (ToS), which utilized LLMs to define search spaces. However, ToS required human intervention to ensure the soundness and completeness of the search components.

Methodology

The authors introduce AutoToS to automate the extraction of search components with minimal human interaction. AutoToS guides LLMs step-by-step toward producing these components through structured feedback derived from both generic and domain-specific unit tests. The primary stages of AutoToS implementation include:

Initial Prompting: Obtaining raw versions of the search components (successor function and goal test) from the LLMs.
Unit Testing for Goal Soundness: Ensuring that goal states are correctly identified with minimal iterations.
Successor Function Soundness Check: Verifying the soundness of successor functions using a breadth-first or depth-first search algorithm extended with timeout checks and transition validity.
(Optional) Successor Function Completeness Check: Enhancing completeness by checking generated successors against a known set of correct transitions.

The approach leverages both generic and domain-specific tests to provide feedback to the LLMs, correcting the generated code iteratively until all unit tests pass or a maximum number of iterations are reached.

Experimental Evaluation

The authors tested AutoToS across five distinct search problems: BlocksWorld, PrOntoQA, Mini Crossword, 24 Game, and Sokoban. Various LLMs from different families were used, including GPT-4o, GPT-4o-Mini, Llama3.1-70b, Llama3.1-405b, and DeepSeek-CoderV2. The performance was evaluated on several benchmarks with up to 19 calls to the LLM per domain.

The authors verified whether unit tests for soundness and completeness significantly improved the code generated by the LLMs, with results indicating substantial improvements, particularly when the partial soundness test was employed. Notably, using unit tests for goal soundness and successor completeness led to a marked increase in accuracy across all tested domains.

Numerical Results

The experiments showed that the average number of calls to the LLM per domain ranged from 2.0 to 10.0 across the models tested (see Table 1). Each experiment was repeated five times, confirming consistency in performance. Notably, the accuracy improved significantly when the models transitioned from raw function generation to unit-tested iterations (see Figure 1). The paper also highlighted the importance of integrating partial soundness checks, which substantially increased final accuracy levels.

Error Analysis and Future Directions

An analysis of the errors from the generated code revealed that different models exhibited varied error patterns (see Figure 2). The paper discusses "bloopers" or interesting error phenomena, like the misunderstanding of state representations in the 24 Game and BlocksWorld, which showcases areas for further refinement.

Potential future explorations proposed in the paper include the automation of unit and partial soundness tests generation, possibly by leveraging LLMs themselves. Another direction could be exploring the use of LLMs for deriving invariants, a concept central to planning problems, to further enhance the soundness and completeness of generated code.

Implications

The implications of this research are impactful for both theoretical understanding and practical applications within the field of AI planning and search. The approach promises to democratize access to advanced planning techniques by reducing the need for human expertise in feedback loops, thereby speeding up the development and deployment of planning solutions. In future AI systems, automated generation and validation of search components can lead to more reliable and efficient problem-solving mechanisms, potentially applicable across a broad spectrum of domains from robotics to complex decision-making systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

Tweets

https://twitter.com/IntuitMachine/status/1827745499174064264

https://twitter.com/IntuitMachine/status/1827838979137225045

https://twitter.com/TheTuringPost/status/1828598407918952820

https://twitter.com/mp_lew/status/1829495130811674947

https://twitter.com/dimitrizho/status/1827000913484972137

https://twitter.com/adebtwx/status/1826979172499046832

Reddit

[IBM Research] Automating Thought of Search: A Journey Towards Soundness and Completeness. 'We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.' (185 points, 53 comments)