Papers
Topics
Authors
Recent
2000 character limit reached

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Published 28 Jan 2022 in cs.CL and cs.AI | (2201.11903v1)

Abstract: Although scaling up LLM size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning. This paper explores the ability of LLMs to generate a coherent chain of thought -- a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently LLMs to better perform reasoning tasks that otherwise have flat scaling curves.

Citations (6,541)

Summary

  • The paper introduces chain-of-thought prompting that guides LLMs through intermediate reasoning steps to improve multi-step solving capabilities.
  • It demonstrates substantial performance gains on arithmetic, commonsense, and symbolic reasoning tasks with models over 100B parameters.
  • The approach requires no fine-tuning, offering enhanced interpretability and practical insights into the model’s reasoning process.

This paper, "Chain-of-Thought Prompting Elicits Reasoning in LLMs" (2201.11903), introduces a simple prompting technique called chain-of-thought (CoT) prompting that significantly enhances the reasoning abilities of LLMs. The core idea is to include a sequence of intermediate reasoning steps—a "chain of thought"—in the few-shot exemplars provided in the prompt, guiding the model to generate similar intermediate steps before producing the final answer.

Core Idea and Motivation

Standard few-shot prompting, where the model is given input-output pairs, has been successful for many tasks but often falls short on those requiring multi-step reasoning, like arithmetic word problems or complex commonsense questions. Prior work addressed this by training or finetuning models to generate intermediate steps or rationales, but creating large datasets of high-quality rationales is costly. Chain-of-thought prompting (2201.11903) combines the benefits of generating intermediate steps with the advantages of few-shot prompting. Instead of just input -> output examples, CoT prompting uses input -> chain of thought -> output examples. This approach requires no model finetuning, allowing a single LLM to perform various reasoning tasks using only few-shot prompting.

Experimental Setup

The researchers evaluated CoT prompting on a diverse set of reasoning tasks:

  1. Arithmetic Reasoning: Math word problems from benchmarks like GSM8K, SVAMP, ASDiv, AQuA, and MAWPS.
  2. Commonsense Reasoning: Tasks including CSQA, StrategyQA, Date Understanding, Sports Understanding, and SayCan robot planning.
  3. Symbolic Reasoning: Toy tasks like last letter concatenation and coin flip, designed to test the model's ability to manipulate symbols and track state.

Experiments were conducted using various LLMs, including LaMDA, GPT-3 (InstructGPT variants), PaLM, UL2, and Codex. For each task, a small number of few-shot exemplars (typically 8, manually composed) were used. The standard prompting baseline used the same exemplars but excluded the intermediate chain-of-thought steps. Greedy decoding was primarily used for generation. For arithmetic tasks, the authors also investigated the effect of using an external Python calculator to evaluate the mathematical expressions generated within the chain of thought, demonstrating that errors can stem from either reasoning logic or arithmetic computation itself.

Key Findings

The experiments revealed several significant findings:

  • Emergent Ability: Chain-of-thought reasoning was found to be an emergent ability of model scale (2201.11903). It did not consistently improve performance, and sometimes even hurt it, for models smaller than approximately 100 billion parameters. Only with sufficiently large models (e.g., GPT-3 175B, PaLM 540B) did CoT prompting consistently and significantly improve performance on reasoning tasks compared to standard prompting. Smaller models tended to produce fluent but often illogical or incoherent chains of thought.
  • Performance Gains: CoT prompting yielded substantial performance improvements across the tested benchmarks.
    • On GSM8K (math word problems), PaLM 540B with CoT achieved a solve rate of 56.9%, a significant jump from 17.9% with standard prompting, surpassing prior state-of-the-art results.
    • Similar large gains were observed on other math datasets like SVAMP and MAWPS, particularly on the more complex multi-step subsets.
    • For commonsense tasks like StrategyQA and Date Understanding, CoT prompting also improved performance, demonstrating its applicability beyond purely numerical problems.
    • In symbolic reasoning tasks (last letter concatenation, coin flip), CoT enabled impressive performance, often approaching 100% accuracy for in-domain examples on large models.
  • Generalization to Length: CoT prompting facilitated generalization to out-of-domain examples with more steps than seen in the few-shot prompt (e.g., longer names for concatenation, more flips for coin tracking), a capability largely absent in standard prompting.
  • Ablation Studies: Experiments compared CoT prompting against variants:
    • Equation only: Prompting the model to output only a mathematical equation before the answer provided some benefit for simpler arithmetic tasks but was less effective than full CoT on complex problems like GSM8K, suggesting the natural language steps are crucial for semantic understanding and decomposition (2201.11903).
    • Variable compute only: Prompting the model to output a series of dots equivalent to the computation length showed little improvement, indicating that simply spending more tokens is not the key; the content of the intermediate steps matters (2201.11903).
    • Reasoning after answer: Placing the chain of thought after the final answer did not improve performance, suggesting that the sequential generation of reasoning steps leading to the answer is essential for deriving the solution (2201.11903).
  • Robustness: While exemplar-based prompting can be sensitive, CoT prompting showed robustness across different annotators who wrote the chains of thought, different sets of exemplars (including those from a separate dataset), and variations in the number and order of exemplars (2201.11903).

Manual Analysis

A manual analysis of generated chains of thought for LaMDA 137B on GSM8K provided insight into why CoT works and where models still fail. For correct answers, the generated chains of thought were mostly logically and mathematically sound. For incorrect answers, errors were categorized:

  • Minor errors (calculator errors, symbol mapping errors, one step missing) accounted for a significant portion of mistakes (46%). Scaling models from 62B to 540B was observed to fix many of these types of errors, suggesting improved semantic understanding and logical flow with scale.
  • Major errors (semantic understanding errors, incoherent reasoning) constituted the remaining mistakes (54%).

This analysis suggests that improvements in foundational abilities like semantic understanding and the ability to maintain coherent, step-by-step logic contribute to the emergence of CoT reasoning at scale (2201.11903).

Practical Implications and Limitations

CoT prompting offers a powerful way to unlock the reasoning capabilities of existing LLMs without needing expensive task-specific finetuning datasets. It provides a degree of interpretability by showing the steps the model took.

However, the approach has limitations:

  • It is most effective only on very large models, which are costly to train and serve.
  • There is no guarantee that the generated chains of thought are factually correct or logically sound, even if they lead to a correct answer, particularly for non-arithmetic tasks. Ensuring the factuality and coherence of generated reasoning remains an open challenge.
  • While few-shot annotation cost is minimal, creating extensive CoT data for potential finetuning applications would be expensive, although synthetic data generation could be explored.
  • Chain of thought may not be beneficial for all tasks, particularly simple ones where standard prompting already performs well or tasks that don't naturally decompose into sequential steps.

The paper concludes that CoT prompting demonstrates that standard prompting may only show a lower bound of LLMs' capabilities and highlights the potential for further exploration of language-based reasoning methods (2201.11903).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper explores a simple idea to help large AI LLMs think better: ask them to “show their work.” The authors call this chain-of-thought prompting. Instead of just giving the model a question and expecting an answer, they give a few examples where each example includes the question, a step-by-step explanation, and the final answer. Doing this helps the model solve harder problems in math, common sense, and puzzles.

What questions does the paper try to answer?

The paper looks at three main questions:

  • If we ask AI models to think step by step, do they solve complex problems more accurately?
  • Does this “show your work” approach only help very large models, or does it also help smaller ones?
  • Is the improvement really from the step-by-step reasoning, or could it just be from other effects (like more text or revealing extra clues)?

How did the researchers test their idea?

The approach is a lot like teaching someone math by showing examples with all the steps.

  • Few-shot prompting: The researchers gave the model a small set of example problems. Each example had three parts: the input (the question), a chain of thought (the step-by-step reasoning in plain language), and the output (the final answer). Then the model had to solve new problems on its own.
  • Models tested: They tried this with several big models, including GPT-3, LaMDA, and PaLM (ranging up to 540 billion parameters—think of “parameters” as the model’s brain cells; more parameters generally means a bigger, smarter model).
  • Tasks: They tested three types of tasks:
    • Math word problems (like the GSM8K benchmark, which has lots of everyday math questions).
    • Commonsense questions (like StrategyQA and tasks that test understanding of dates or sports).
    • Symbolic puzzles (like taking the last letters of names and stringing them together, or tracking whether a coin stays heads after several flips).
  • Ablation (fair tests): To check that the step-by-step reasoning is the key ingredient, they also tried variations:
    • Equation-only: Just write the math equation before the answer, no explanation.
    • Variable compute: Add filler (like dots “…”) to use more text without real reasoning.
    • Reasoning after the answer: Put the explanation after the answer to see if the answer was really produced by the reasoning.

What did they find, and why is it important?

Here are the main takeaways:

  • Showing your work makes a big difference—but mostly for very large models. Smaller models can produce fluent explanations that sound nice but don’t actually help; big models improve a lot when prompted to reason step by step.
  • Big performance jumps on hard math. For example, PaLM 540B asked to use chain-of-thought got a new best score on GSM8K (math word problems), beating earlier systems that were specially trained for the task. In the paper’s figure, standard prompting with PaLM 540B solved about 18% of GSM8K questions, while chain-of-thought prompting solved about 57%, surpassing the previous best at 55%.
  • Better commonsense reasoning. On StrategyQA, chain-of-thought with PaLM 540B reached about 78%, topping the prior best (around 69%). On a sports understanding test, it reached about 95%, even higher than an unaided sports enthusiast (about 84%).
  • Strong on symbolic puzzles and generalization. With chain-of-thought, large models handled puzzles nearly perfectly when the test problems were similar to the examples. Even when the test problems were longer or had more steps than the examples, performance still improved—showing the model learned the pattern and could stretch it to tougher cases.
  • The boost comes from real reasoning, not tricks. The ablations showed that:
    • Just writing equations didn’t help on hard multi-step problems like GSM8K.
    • Adding extra tokens (like “…”) without reasoning didn’t help.
    • Putting explanations after the answer didn’t help—models did better when they reasoned before answering.
  • Robust and flexible. Different people wrote the step-by-step examples in different styles, and the model still improved. Using other sets of examples also worked. This suggests the method doesn’t rely on one “magic prompt,” but on the general idea of step-by-step thinking.

What does this mean for the future?

  • Easier, faster setup: Instead of training a new model for every task (which needs lots of labeled data), you can often get strong results by carefully prompting a single large model with a few step-by-step examples.
  • More trustworthy reasoning: Because the model writes out its thought process, we can inspect where it goes wrong and fix prompts or catch mistakes. It’s not perfect “transparency,” but it’s a helpful window into the model’s thinking.
  • Wider reach: Math, commonsense, and puzzle-like tasks all benefit. This hints that any problem humans can solve by explaining steps in language might be helped by chain-of-thought prompting.
  • Scale matters: The biggest gains appear in very large models. As models continue to grow and improve, chain-of-thought prompting may become even more powerful.
  • Better generalization: The method helps models tackle longer, more complex problems than they saw in the examples, which is key for solving real-world tasks that aren’t fixed in length or form.

In short, getting AI to “show its work” is a simple but effective way to unlock deeper reasoning—especially in very LLMs—and it can make them more capable, more explainable, and more broadly useful.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 118 tweets with 2418 likes about this paper.