Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Automating Thought of Search: A Journey Towards Soundness and Completeness (2408.11326v2)

Published 21 Aug 2024 in cs.AI

Abstract: LLMs are being used to solve planning problems that require search. Most of the literature uses LLMs as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having LLMs produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. Consequently, there is great potential to automate the ToS process. We take a first major step towards automating ToS (AutoToS), taking the human out of the loop of interactions with the LLM. AutoToS guides the LLM step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We show that AutoToS is able to achieve 100% accuracy on all the evaluated domains with a small number of LLM calls.

Citations (1)

Summary

  • The paper introduces AutoToS, a methodology that automates the extraction of search components using LLMs and iterative unit tests.
  • It details a process involving goal soundness and successor function validation, demonstrating significant accuracy improvements across five varied search problems.
  • Experimental results reveal reduced LLM calls and enhanced performance, paving the way for more autonomous and reliable planning systems.

Automating Thought of Search: A Journey Towards Soundness and Completeness

Overview

The paper "Automating Thought of Search: A Journey Towards Soundness and Completeness" by Daniel Cao et al. presents a significant step forward in leveraging LLMs for automating planning tasks by proposing a methodology that removes human involvement from the loop of generating sound and complete search components. This work builds upon the previous concept of Thought of Search (ToS), which utilized LLMs to define search spaces. However, ToS required human intervention to ensure the soundness and completeness of the search components.

Methodology

The authors introduce AutoToS to automate the extraction of search components with minimal human interaction. AutoToS guides LLMs step-by-step toward producing these components through structured feedback derived from both generic and domain-specific unit tests. The primary stages of AutoToS implementation include:

  1. Initial Prompting: Obtaining raw versions of the search components (successor function and goal test) from the LLMs.
  2. Unit Testing for Goal Soundness: Ensuring that goal states are correctly identified with minimal iterations.
  3. Successor Function Soundness Check: Verifying the soundness of successor functions using a breadth-first or depth-first search algorithm extended with timeout checks and transition validity.
  4. (Optional) Successor Function Completeness Check: Enhancing completeness by checking generated successors against a known set of correct transitions.

The approach leverages both generic and domain-specific tests to provide feedback to the LLMs, correcting the generated code iteratively until all unit tests pass or a maximum number of iterations are reached.

Experimental Evaluation

The authors tested AutoToS across five distinct search problems: BlocksWorld, PrOntoQA, Mini Crossword, 24 Game, and Sokoban. Various LLMs from different families were used, including GPT-4o, GPT-4o-Mini, Llama3.1-70b, Llama3.1-405b, and DeepSeek-CoderV2. The performance was evaluated on several benchmarks with up to 19 calls to the LLM per domain.

The authors verified whether unit tests for soundness and completeness significantly improved the code generated by the LLMs, with results indicating substantial improvements, particularly when the partial soundness test was employed. Notably, using unit tests for goal soundness and successor completeness led to a marked increase in accuracy across all tested domains.

Numerical Results

The experiments showed that the average number of calls to the LLM per domain ranged from 2.0 to 10.0 across the models tested (see Table 1). Each experiment was repeated five times, confirming consistency in performance. Notably, the accuracy improved significantly when the models transitioned from raw function generation to unit-tested iterations (see Figure 1). The paper also highlighted the importance of integrating partial soundness checks, which substantially increased final accuracy levels.

Error Analysis and Future Directions

An analysis of the errors from the generated code revealed that different models exhibited varied error patterns (see Figure 2). The paper discusses "bloopers" or interesting error phenomena, like the misunderstanding of state representations in the 24 Game and BlocksWorld, which showcases areas for further refinement.

Potential future explorations proposed in the paper include the automation of unit and partial soundness tests generation, possibly by leveraging LLMs themselves. Another direction could be exploring the use of LLMs for deriving invariants, a concept central to planning problems, to further enhance the soundness and completeness of generated code.

Implications

The implications of this research are impactful for both theoretical understanding and practical applications within the field of AI planning and search. The approach promises to democratize access to advanced planning techniques by reducing the need for human expertise in feedback loops, thereby speeding up the development and deployment of planning solutions. In future AI systems, automated generation and validation of search components can lead to more reliable and efficient problem-solving mechanisms, potentially applicable across a broad spectrum of domains from robotics to complex decision-making systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.