Chain of Thought Prompting Elicits Reasoning in Large Language Models
Abstract: Although scaling up LLM size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning. This paper explores the ability of LLMs to generate a coherent chain of thought -- a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently LLMs to better perform reasoning tasks that otherwise have flat scaling curves.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper explores a simple idea to help large AI LLMs think better: ask them to “show their work.” The authors call this chain-of-thought prompting. Instead of just giving the model a question and expecting an answer, they give a few examples where each example includes the question, a step-by-step explanation, and the final answer. Doing this helps the model solve harder problems in math, common sense, and puzzles.
What questions does the paper try to answer?
The paper looks at three main questions:
- If we ask AI models to think step by step, do they solve complex problems more accurately?
- Does this “show your work” approach only help very large models, or does it also help smaller ones?
- Is the improvement really from the step-by-step reasoning, or could it just be from other effects (like more text or revealing extra clues)?
How did the researchers test their idea?
The approach is a lot like teaching someone math by showing examples with all the steps.
- Few-shot prompting: The researchers gave the model a small set of example problems. Each example had three parts: the input (the question), a chain of thought (the step-by-step reasoning in plain language), and the output (the final answer). Then the model had to solve new problems on its own.
- Models tested: They tried this with several big models, including GPT-3, LaMDA, and PaLM (ranging up to 540 billion parameters—think of “parameters” as the model’s brain cells; more parameters generally means a bigger, smarter model).
- Tasks: They tested three types of tasks:
- Math word problems (like the GSM8K benchmark, which has lots of everyday math questions).
- Commonsense questions (like StrategyQA and tasks that test understanding of dates or sports).
- Symbolic puzzles (like taking the last letters of names and stringing them together, or tracking whether a coin stays heads after several flips).
- Ablation (fair tests): To check that the step-by-step reasoning is the key ingredient, they also tried variations:
- Equation-only: Just write the math equation before the answer, no explanation.
- Variable compute: Add filler (like dots “…”) to use more text without real reasoning.
- Reasoning after the answer: Put the explanation after the answer to see if the answer was really produced by the reasoning.
What did they find, and why is it important?
Here are the main takeaways:
- Showing your work makes a big difference—but mostly for very large models. Smaller models can produce fluent explanations that sound nice but don’t actually help; big models improve a lot when prompted to reason step by step.
- Big performance jumps on hard math. For example, PaLM 540B asked to use chain-of-thought got a new best score on GSM8K (math word problems), beating earlier systems that were specially trained for the task. In the paper’s figure, standard prompting with PaLM 540B solved about 18% of GSM8K questions, while chain-of-thought prompting solved about 57%, surpassing the previous best at 55%.
- Better commonsense reasoning. On StrategyQA, chain-of-thought with PaLM 540B reached about 78%, topping the prior best (around 69%). On a sports understanding test, it reached about 95%, even higher than an unaided sports enthusiast (about 84%).
- Strong on symbolic puzzles and generalization. With chain-of-thought, large models handled puzzles nearly perfectly when the test problems were similar to the examples. Even when the test problems were longer or had more steps than the examples, performance still improved—showing the model learned the pattern and could stretch it to tougher cases.
- The boost comes from real reasoning, not tricks. The ablations showed that:
- Just writing equations didn’t help on hard multi-step problems like GSM8K.
- Adding extra tokens (like “…”) without reasoning didn’t help.
- Putting explanations after the answer didn’t help—models did better when they reasoned before answering.
- Robust and flexible. Different people wrote the step-by-step examples in different styles, and the model still improved. Using other sets of examples also worked. This suggests the method doesn’t rely on one “magic prompt,” but on the general idea of step-by-step thinking.
What does this mean for the future?
- Easier, faster setup: Instead of training a new model for every task (which needs lots of labeled data), you can often get strong results by carefully prompting a single large model with a few step-by-step examples.
- More trustworthy reasoning: Because the model writes out its thought process, we can inspect where it goes wrong and fix prompts or catch mistakes. It’s not perfect “transparency,” but it’s a helpful window into the model’s thinking.
- Wider reach: Math, commonsense, and puzzle-like tasks all benefit. This hints that any problem humans can solve by explaining steps in language might be helped by chain-of-thought prompting.
- Scale matters: The biggest gains appear in very large models. As models continue to grow and improve, chain-of-thought prompting may become even more powerful.
- Better generalization: The method helps models tackle longer, more complex problems than they saw in the examples, which is key for solving real-world tasks that aren’t fixed in length or form.
In short, getting AI to “show its work” is a simple but effective way to unlock deeper reasoning—especially in very LLMs—and it can make them more capable, more explainable, and more broadly useful.
Collections
Sign up for free to add this paper to one or more collections.