Papers
Topics
Authors
Recent
2000 character limit reached

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models (2305.04091v3)

Published 6 May 2023 in cs.CL

Abstract: LLMs have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual effort, Zero-shot-CoT concatenates the target problem statement with "Let's think step by step" as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Plan-and-Solve (PS) Prompting. It consists of two components: first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan. To address the calculation errors and improve the quality of generated reasoning steps, we extend PS prompting with more detailed instructions and derive PS+ prompting. We evaluate our proposed prompting strategy on ten datasets across three reasoning problems. The experimental results over GPT-3 show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin, is comparable to or exceeds Zero-shot-Program-of-Thought Prompting, and has comparable performance with 8-shot CoT prompting on the math reasoning problem. The code can be found at https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting.

Citations (230)

Summary

  • The paper proposes a novel Plan-and-Solve prompting technique that reduces calculation errors from 7% to 5% and missing-step errors from 12% to 7%.
  • The method leverages explicit problem decomposition, variable extraction, and stepwise execution to enhance reasoning across arithmetic, commonsense, and symbolic tasks.
  • Empirical results show that PS+ prompting nearly matches few-shot CoT performance without needing in-context exemplars, improving overall efficiency.

Plan-and-Solve Prompting: Enhancing Zero-Shot CoT Reasoning in LLMs

Motivation and Critical Review of Zero-shot CoT Reasoning

The development of Chain-of-Thought (CoT) prompting has played a key role in eliciting multi-step reasoning ability from LLMs. Few-shot CoT prompting demonstrates that explicit demonstration of reasoning chains increases LLM performance on complex tasks. Zero-shot-CoT further simplifies prompting by adding a trigger ("Let's think step by step") to the query, removing the need for exemplars. Despite its efficacy, error analysis reveals substantial limitations: 7% calculation errors, 12% missing-step errors, and 27% semantic misunderstanding errors on GSM8K (Figure 1). Figure 1

Figure 1: Distribution of calculation, missing-step, and semantic errors found in GSM8K problems answered incorrectly by Zero-shot-CoT with GPT-3.

The root causes for missing-step and calculation errors stem partly from the lack of explicit problem decomposition and insufficient attention to intermediate computation. These error modes highlight the structural limitations of existing zero-shot prompt designs and motivate the introduction of alternative prompting paradigms.

Plan-and-Solve (PS) and PS+ Prompting: Methodological Innovations

Plan-and-Solve (PS) prompting replaces the canonical zero-shot-CoT trigger with structured instructions: the model is asked first to understand the problem and devise a solution plan, then execute the plan step-by-step. The PS+ variant further augments this approach by explicitly instructing the LLM to extract relevant variables, assign numerals, and verify calculations with attention to commonsense and correctness. These instructions operationalize a process of explicit subtask decomposition and reduce opportunities for both computational and logical omissions. Figure 2

Figure 2: Prompting and output comparison of Zero-shot-CoT (a), PS prompting (b), and answer-extraction prompting (c). PS prompting explicitly introduces plan decomposition, outperforming basic CoT in complex scenarios.

The new prompt template yields output chains with distinct planning and execution phases. Extensive template ablations confirm that the presence of variable extraction, explicit planning, and intermediate calculation directives each contribute incrementally to overall accuracy, culminating with PS+ prompting achieving the highest performance (Table: Template Study).

Empirical Results: Comparative and Error-Type Analysis

Benchmarking across ten datasets (six arithmetic, two commonsense, two symbolic) demonstrates that PS+ consistently surpasses Zero-shot-CoT on all arithmetic tasks—often by margins exceeding 5%. PS+ performs comparably to or exceeds Zero-shot-Program-of-Thought (PoT) prompting and approaches the performance of few-shot manual CoT prompting, notably without the need for exemplars.

Error analysis shows that PS+ reduces calculation errors to 5% and missing-step errors to 7%, compared to 7% and 12% for Zero-shot-CoT, respectively, while semantic errors remain unchanged (27%). Correlation analysis of the reasoning chain content demonstrates that explicit variable extraction and planning are negatively correlated with calculation and missing-step errors, confirming that prompt specificity is pivotal for error reduction.

Self-consistency decoding further boosts performance, with PS+ achieving 84.4% accuracy on SVAMP and 73.7% on GSM8K under SC voting, outperforming Zero-shot-CoT in both cases. These improvements demonstrate better reliability and robustness of the PS(+)-prompted solutions.

Theoretical and Practical Implications

Plan-and-Solve strategies instantiate a modular approach to zero-shot reasoning, aligning closely with hierarchical task decompositions studied in classical AI. The methodology validates that LLMs, when prompted for explicit problem decomposition and intermediate variable extraction, can realize emergent planning capabilities—evidenced by the high rate (90%) of plan presence in generated chains.

Practically, PS+ prompting significantly reduces dependence on model fine-tuning and corpus-specific exemplars, enabling immediate deployment in resource-constrained or proprietary contexts. The approach generalizes across arithmetic, symbolic, and commonsense reasoning and remains agnostic to step type and task structure.

Theoretically, the findings support the hypothesis that the reasoning limitations of LLMs are in part a function of prompt underspecification and not intrinsic architecture. PS+ demonstrates the degree to which reasoning fidelity and error rate can be influenced by explicit prompting design, setting the foundation for further systematic prompting research.

Speculation on Future Directions

Refinement of Plan-and-Solve prompting can extend in several directions:

  1. Adaptive and dynamic prompt generation: Automating the crafting of PS+ templates based on semantic parsing of task instructions may further improve error rates.
  2. Hierarchical reasoning and multi-agent composition: PS+ provides a basis for recursive or decentralized reasoning workflows, potentially leveraging self-improvement or verification loops.
  3. Generalization to non-reasoning tasks: The methodology indicates promise for generalized instruction following, content planning, and operational control in LLMs.
  4. Addressing semantic misunderstanding errors: Beyond structural refinements, integrating external symbolic or commonsense priors may be necessary to reduce the persistent semantic class of errors.

Conclusion

Plan-and-Solve prompting and its enhanced variant PS+ mark a substantive advance in zero-shot CoT reasoning for LLMs. By demanding explicit planning, variable extraction, and stepwise execution, PS+ nearly matches few-shot CoT approaches—without demonstration examples—and robustly overcomes calculation and missing-step limitations of earlier methods. The approach generalizes across arithmetic, commonsense, and symbolic domains, illustrating both the prompt-sensitivity of current LLM architectures and the emergent planning faculty accessible via carefully-crafted instructions. This paradigm opens avenues for further research in adaptive prompting, hierarchical reasoning, and broadening the applicability of large-scale zero-shot LLMs for complex task domains.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 11 likes about this paper.