Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (2205.10625v3)

Published 21 May 2022 in cs.AI and cs.CL

Abstract: Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.

PDF Abstract

Least-to-Most Prompting Enables Complex Reasoning in LLMs

In "Least-to-Most Prompting Enables Complex Reasoning in LLMs," Zhou et al. delve into addressing the limitations of standard chain-of-thought (CoT) prompting in LLMs. While CoT prompting has shown commendable performance improvements over conventional few-shot prompting, it struggles with tasks that require generalization to more complex problems than those seen in training examples. The authors propose least-to-most (L2M) prompting as a novel strategy to overcome this easy-to-hard generalization issue by segmenting complex problems into a sequence of simpler subproblems and solving them incrementally.

Methodology

Least-to-most prompting operates in a two-stage process: decomposition followed by sequential problem solving. The decomposition stage involves breaking down a complex problem into simpler subproblems. In the sequential problem-solving stage, these subproblems are tackled one at a time, where each subproblem solution informs the subsequent one. This strategy is implemented via few-shot prompting without any need for additional training or fine-tuning.

Results

The authors present experimental validation of the L2M approach across three domains: symbolic manipulation, compositional generalization, and mathematical reasoning tasks. Each domain illustrates the significant advantages of L2M over standard CoT prompting.

Symbolic Manipulation

The last-letter-concatenation task serves as a benchmark for symbolic manipulation. CoT prompting performs well only if the length of testing lists does not exceed those in the prompt exemplars; it falters otherwise. The L2M method shows substantial improvements in length generalization, particularly for longer lists where CoT's performance degrades.

Compositional Generalization

The SCAN dataset aids in assessing compositional generalization capabilities. SCAN tasks require translating natural language commands into action sequences, with stringent requirements for length generalization. While CoT prompts yield limited success, L2M achieves near-perfect accuracy (99.7%) using only 14 demonstration examples within the GPT-3 code-davinci-002 model. This performance holds across any task split, including length splits, significantly outperforming specialized neural-symbolic models which require extensive dataset-specific training.

Mathematical Reasoning

In mathematical reasoning tasks, specifically those found within the GSM8K and DROP datasets, L2M shows enhancement over CoT prompting, especially in solving multi-step problems. For instance, in tasks necessitating five or more steps, accuracy with L2M markedly surpasses CoT. For DROP—a dataset predominantly consisting of simple decomposable problems—L2M outperforms CoT by a notable margin, reflecting its broader applicability and robustness in numerical reasoning.

Implications and Future Directions

The implications of this novel prompting method are multifaceted:

Practical: The L2M technique can dramatically boost the performance of state-of-the-art LLMs on tasks requiring extended reasoning capabilities—facilitating more accurate and deeper natural language understanding.
Theoretical: The research offers insights into improving generalization in machine learning models, especially for tasks where training examples span a wide range of complexity.
Future Work: The research opens avenues for further exploration into more advanced decomposition strategies and their applications across diverse problem domains. It also points to the potential development of hybrid models that can seamlessly integrate L2M with other advanced prompting techniques.

Conclusion

Least-to-most prompting represents a significant advancement in prompting techniques for LLMs, demonstrating enhanced capabilities in handling complex reasoning tasks. By adopting a hierarchical problem-solving approach, it overcomes the inherent limitations of CoT prompting, showing superior performance across symbolic manipulation, compositional generalization, and math reasoning tasks. This dual-phase strategy of problem decomposition and sequential resolution establishes a new paradigm in the quest for AI systems capable of sophisticated, human-like reasoning.