Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Thoughtlessness? An Analysis of CoT in Planning (2405.04776v2)

Published 8 May 2024 in cs.AI
Chain of Thoughtlessness? An Analysis of CoT in Planning

Abstract: LLM performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

Exploring the Limits of Chain of Thought Prompting in Blocksworld with LLMs

Introduction to Chain of Thought Prompting

The idea behind Chain of Thought (CoT) prompting in LLMs is captivating for both practitioners and researchers in the AI field. By inserting intermediate reasoning steps into prompts—the so-called chains of thought—the goal is to guide LLMs to better perform on complex reasoning tasks without the need for retraining. This capability to 'teach' LLMs to solve problems through example-driven learning is gleaned from human problem-solving methodologies. Yet, how this plays out practically, especially in nuanced domains like planning in Blocksworld, opens up an avenue full of challenges and revelations.

The Setup: Experiments in Blocksworld

In a nutshell, Blocksworld entails rearranging blocks to achieve a specific configuration. It’s a classic problem domain often used in AI because of its clear-cut planning nature. The challenge becomes a vessel for checking if an LLM can practically apply shown reasoning steps to solve unseen, similar tasks. Five types of CoT setups were examined:

  1. Zero-shot CoT: The most general approach where the model is merely prompted to "think step by step."
  2. Progression Proof CoT: Introducing planning domain knowledge associated with the PDDL specifications used in Blocksworld but tries to stay somewhat generic.
  3. Blocksworld Universal Algorithm: Offers a specific algorithmic approach tailored for any Blocksworld problem.
  4. Stacking Prompt: Directly focuses on problems where blocks begin on the table and must be assembled into a single stack.
  5. Lexicographic Stacking: Targets only a subset of stacking problems where blocks must be stacked in a specific sequence.

Observations and Results

The results reveal a discerning picture of the CoT's effectiveness. When the prompts were exceedingly specific to the problem class, performance improvements were noticeable. However, these gains diminished rapidly as the problems deviated even slightly from the examples provided. Here’s a brief rundown:

  • Zero-shot and Progression Proof CoT showed minimal improvements, proving insufficient for even moderately complex planning tasks.
  • Blocksworld Universal Algorithm prompted better responses from the LLMs, yet struggled significantly as the complexity of Blocksworld scenarios increased.
  • Stacking and Lexicographic Prompts yielded high performance on narrowly defined tasks but failed to generalize across slightly broader problem sets despite still being within the stack-assembly category.

Implications and Speculations on Future Developments

These findings suggest that while CoT can nudge LLMs toward better task-specific performance, the scope of such enhancements is heavily tethered to the prompt’s specificity relative to the problem. This poses significant implications:

  • Scalability: Broad application of CoT is limited. As the problem's complexity grows, so does the need for incredibly precise prompts, making this strategy less scalable across diverse tasks without substantial human input.
  • Practical Utility: The reliance on detailed, problem-specific prompts diminishes the utility of CoT for general problem-solving using LLMs. For real-world applications, generating such detailed prompts could become an overhead that outweighs the benefits.
  • Future Research: Continual improvement in CoT methodologies might focus on finding a balance between prompt generality and performance, potentially through more sophisticated techniques of teaching LLMs to better abstract and generalize from examples.

Conclusion

Investigating the CoT performance across a range of specialized to generalized setups in Blocksworld presents a clear verdict: performance is closely knit with the specificity and alignment of the CoT prompt to the problem class. LLMs, in their current form, excel in tasks that mimic training examples but show diminished aptitude when required to generalize strategies across broader scenarios. The dream of leveraging LLMs for general reasoning through CoT remains, for now, a meticulously crafted prompt away. Further explorations could illuminate pathways to enhance the flexibility and learning capacity of these fascinating models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kaya Stechly (9 papers)
  2. Karthik Valmeekam (17 papers)
  3. Subbarao Kambhampati (126 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com