Complexity-Based Prompting for Multi-Step Reasoning (2210.00720v2)

Published 3 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We study the task of prompting large-scale LLMs to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, LLMs can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.

PDF Abstract

Complexity-Based Prompting for Multi-step Reasoning: A Methodological Examination

The paper "Complexity-Based Prompting for Multi-step Reasoning" explores an area critically relevant to the functionality and enhancement of LLMs, namely their performance on tasks requiring multi-step reasoning. As foundational work within this space, it appraises and advances the methodology of chain-of-thought (CoT) prompting which involves utilizing coherent intermediate reasoning steps that guide the LLM to conclusions rather than direct answers to inputs.

The authors introduce a novel approach termed "complexity-based prompting" which fundamentally selects examples with higher reasoning complexity—defined by the number of logical steps within the reasoning chain—as prompts to enhance the model’s performance in reasoning tasks. The paper postulates that the inclusion of complex multitiered examples within the prompt improves the LLM's capability to tackle intricate reasoning problems, thereby achieving substantial gains over traditional selection methods.

Key experimental results indicate that this method confers a mean accuracy improvement of +5.3, with peak performance gains reaching +18 on benchmark tasks, when integrated with LLMs such as GPT-3 and Codex. The experiments span five datasets including math-specific benchmarks (GSM8K, MultiArith, MathQA) and general reasoning tasks within the BigBenchHard suite (Date Understanding and Penguins). Additionally, the research takes a comprehensive approach by extending the complexity criteria from prompting inputs to decoding outputs—a phase where the model evaluates and selects the most valid output from multiple generated reasoning chains.

In relation to existing solutions such as manual example tuning or heuristic-based selection schemes, complexity-based selection stands out with its minimal annotation requirement and ease of implementation. Particularly, the results showcase its robustness even under the distribution shift and format variations scenarios. The notable consistency and broad applicability of this strategy reinforce its potential adaptation into a broader class of challenges encountered by LLMs beyond those explored here.

From a theoretical perspective, this work significantly builds on the understanding of emergent abilities in LLMs, especially as it relates to their propensity for sophisticated reasoning with larger parameter counts (e.g., >100B parameters as outlined in previous studies). Furthermore, it implicitly interrogates and affirms the task-specific utility gains from in-context learning strategies over more computationally expensive and less flexible fine-tuning procedures.

The implication of complexity-based prompting reaches into future AI developments concerning LLM robustness and reasoning granularity, suggesting a heuristic that embraces complexity rather than reducing it as pathways for more refined knowledge extraction and application. This insight intrinsically challenges researchers to reexamine the chains of hidden capabilities within LLMs, facilitating a paradigm where complexity becomes an ally in the unravelling of computational reasoning.

While the proposed methodology demonstrably advances the current state of the art in multi-step reasoning, especially in instances where reasoning annotations are sparse or expensive, it will be essential for future research to explore how complexity metrics may dynamically integrate with more nuanced elements of context, language variability, and computational efficiency. As with any forefront discovery in machine learning, these questions pave the way for continued inquiry into not only making LLMs more reason-capable but also more universally adaptable to diverse cognitive tasks indicative of human-like understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yao Fu (83 papers)
Hao Peng (291 papers)
Ashish Sabharwal (84 papers)
Peter Clark (108 papers)
Tushar Khot (53 papers)

Citations (358)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/IntuitMachine/status/1759650143316394304

https://twitter.com/tajiknumbe64351/status/1884030538253791273