Chain of Thought Prompting Elicits Reasoning in Large Language Models (2201.11903v1)

Published 28 Jan 2022 in cs.CL and cs.AI

Abstract: Although scaling up LLM size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning. This paper explores the ability of LLMs to generate a coherent chain of thought -- a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently LLMs to better perform reasoning tasks that otherwise have flat scaling curves.

PDF Abstract

This paper, "Chain-of-Thought Prompting Elicits Reasoning in LLMs" (Wei et al., 2022 ), introduces a simple prompting technique called chain-of-thought (CoT) prompting that significantly enhances the reasoning abilities of LLMs. The core idea is to include a sequence of intermediate reasoning steps—a "chain of thought"—in the few-shot exemplars provided in the prompt, guiding the model to generate similar intermediate steps before producing the final answer.

Core Idea and Motivation

Standard few-shot prompting, where the model is given input-output pairs, has been successful for many tasks but often falls short on those requiring multi-step reasoning, like arithmetic word problems or complex commonsense questions. Prior work addressed this by training or finetuning models to generate intermediate steps or rationales, but creating large datasets of high-quality rationales is costly. Chain-of-thought prompting (Wei et al., 2022 ) combines the benefits of generating intermediate steps with the advantages of few-shot prompting. Instead of just input -> output examples, CoT prompting uses input -> chain of thought -> output examples. This approach requires no model finetuning, allowing a single LLM to perform various reasoning tasks using only few-shot prompting.

Experimental Setup

The researchers evaluated CoT prompting on a diverse set of reasoning tasks:

Arithmetic Reasoning: Math word problems from benchmarks like GSM8K, SVAMP, ASDiv, AQuA, and MAWPS.
Commonsense Reasoning: Tasks including CSQA, StrategyQA, Date Understanding, Sports Understanding, and SayCan robot planning.
Symbolic Reasoning: Toy tasks like last letter concatenation and coin flip, designed to test the model's ability to manipulate symbols and track state.

Experiments were conducted using various LLMs, including LaMDA, GPT-3 (InstructGPT variants), PaLM, UL2, and Codex. For each task, a small number of few-shot exemplars (typically 8, manually composed) were used. The standard prompting baseline used the same exemplars but excluded the intermediate chain-of-thought steps. Greedy decoding was primarily used for generation. For arithmetic tasks, the authors also investigated the effect of using an external Python calculator to evaluate the mathematical expressions generated within the chain of thought, demonstrating that errors can stem from either reasoning logic or arithmetic computation itself.

Key Findings

The experiments revealed several significant findings:

Emergent Ability: Chain-of-thought reasoning was found to be an emergent ability of model scale (Wei et al., 2022 ). It did not consistently improve performance, and sometimes even hurt it, for models smaller than approximately 100 billion parameters. Only with sufficiently large models (e.g., GPT-3 175B, PaLM 540B) did CoT prompting consistently and significantly improve performance on reasoning tasks compared to standard prompting. Smaller models tended to produce fluent but often illogical or incoherent chains of thought.
Performance Gains: CoT prompting yielded substantial performance improvements across the tested benchmarks.
- On GSM8K (math word problems), PaLM 540B with CoT achieved a solve rate of 56.9%, a significant jump from 17.9% with standard prompting, surpassing prior state-of-the-art results.
- Similar large gains were observed on other math datasets like SVAMP and MAWPS, particularly on the more complex multi-step subsets.
- For commonsense tasks like StrategyQA and Date Understanding, CoT prompting also improved performance, demonstrating its applicability beyond purely numerical problems.
- In symbolic reasoning tasks (last letter concatenation, coin flip), CoT enabled impressive performance, often approaching 100% accuracy for in-domain examples on large models.
Generalization to Length: CoT prompting facilitated generalization to out-of-domain examples with more steps than seen in the few-shot prompt (e.g., longer names for concatenation, more flips for coin tracking), a capability largely absent in standard prompting.
Ablation Studies: Experiments compared CoT prompting against variants:
- Equation only: Prompting the model to output only a mathematical equation before the answer provided some benefit for simpler arithmetic tasks but was less effective than full CoT on complex problems like GSM8K, suggesting the natural language steps are crucial for semantic understanding and decomposition (Wei et al., 2022 ).
- Variable compute only: Prompting the model to output a series of dots equivalent to the computation length showed little improvement, indicating that simply spending more tokens is not the key; the content of the intermediate steps matters (Wei et al., 2022 ).
- Reasoning after answer: Placing the chain of thought after the final answer did not improve performance, suggesting that the sequential generation of reasoning steps leading to the answer is essential for deriving the solution (Wei et al., 2022 ).
Robustness: While exemplar-based prompting can be sensitive, CoT prompting showed robustness across different annotators who wrote the chains of thought, different sets of exemplars (including those from a separate dataset), and variations in the number and order of exemplars (Wei et al., 2022 ).

Manual Analysis

A manual analysis of generated chains of thought for LaMDA 137B on GSM8K provided insight into why CoT works and where models still fail. For correct answers, the generated chains of thought were mostly logically and mathematically sound. For incorrect answers, errors were categorized:

Minor errors (calculator errors, symbol mapping errors, one step missing) accounted for a significant portion of mistakes (46%). Scaling models from 62B to 540B was observed to fix many of these types of errors, suggesting improved semantic understanding and logical flow with scale.
Major errors (semantic understanding errors, incoherent reasoning) constituted the remaining mistakes (54%).

This analysis suggests that improvements in foundational abilities like semantic understanding and the ability to maintain coherent, step-by-step logic contribute to the emergence of CoT reasoning at scale (Wei et al., 2022 ).

Practical Implications and Limitations

CoT prompting offers a powerful way to unlock the reasoning capabilities of existing LLMs without needing expensive task-specific finetuning datasets. It provides a degree of interpretability by showing the steps the model took.

However, the approach has limitations:

It is most effective only on very large models, which are costly to train and serve.
There is no guarantee that the generated chains of thought are factually correct or logically sound, even if they lead to a correct answer, particularly for non-arithmetic tasks. Ensuring the factuality and coherence of generated reasoning remains an open challenge.
While few-shot annotation cost is minimal, creating extensive CoT data for potential finetuning applications would be expensive, although synthetic data generation could be explored.
Chain of thought may not be beneficial for all tasks, particularly simple ones where standard prompting already performs well or tasks that don't naturally decompose into sequential steps.

The paper concludes that CoT prompting demonstrates that standard prompting may only show a lower bound of LLMs' capabilities and highlights the potential for further exploration of language-based reasoning methods (Wei et al., 2022 ).