Self-Consistency Improves Chain of Thought Reasoning in Language Models (2203.11171v4)

Published 21 Mar 2022 in cs.CL and cs.AI

Abstract: Chain-of-thought prompting combined with pre-trained LLMs has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

PDF Abstract

This paper introduces self-consistency, a simple yet effective decoding strategy designed to enhance the reasoning capabilities of LLMs when used with chain-of-thought (CoT) prompting (Wei et al., 2022 ). While CoT prompting improves reasoning by guiding models to generate step-by-step thought processes, it typically relies on greedy decoding, which selects the single most probable reasoning path. This can lead to suboptimal or incorrect answers.

Self-consistency addresses this limitation by leveraging the intuition that complex reasoning problems often have multiple valid paths to the correct solution. Incorrect reasoning processes, even if plausible locally, are less likely to converge consistently on the same wrong answer.

The self-consistency method involves three main steps:

Prompting: An LLM is prompted with a question using standard few-shot CoT exemplars, which include questions, reasoning steps, and answers.
Sampling Diverse Paths: Instead of greedily decoding a single reasoning path, the method samples multiple (e.g., 40) diverse candidate reasoning paths from the LLM's decoder. This is achieved using standard sampling techniques like temperature sampling, top-k sampling, or nucleus sampling. Each sampled output contains both a reasoning path ( $\mathbf{r}_i$ ) and a final answer ( $\mathbf{a}_i$ ).
Aggregation and Selection: The final answers ( $\mathbf{a}_i$ ) derived from the diverse set of reasoning paths are aggregated. The method then marginalizes out the reasoning paths by selecting the answer that appears most frequently (i.e., the most consistent answer) via a majority vote.

def self_consistency(prompt, question, model, num_paths, sampling_params):
  """
  Implements the self-consistency decoding strategy.

  Args:
    prompt: Few-shot CoT prompt examples.
    question: The input question to answer.
    model: The LLM.
    num_paths: The number of diverse reasoning paths to sample.
    sampling_params: Parameters for the sampling strategy (e.g., temperature, top_k).

  Returns:
    The most consistent final answer.
  """
  full_prompt = prompt + "\nQ: " + question + "\nA:"
  generated_outputs = []

  # 1. Sample diverse reasoning paths
  for _ in range(num_paths):
    # Generate a complete reasoning path and answer
    output = model.generate(full_prompt, **sampling_params)
    generated_outputs.append(output)

  # 2. Extract final answers
  final_answers = []
  for output in generated_outputs:
    # Task-specific parser to extract the final answer from the generated text
    answer = parse_final_answer(output)
    if answer is not None:
      final_answers.append(answer)

  if not final_answers:
    return None # Or handle appropriately

  # 3. Aggregate and find the most consistent answer (majority vote)
  answer_counts = {}
  for answer in final_answers:
    answer_counts[answer] = answer_counts.get(answer, 0) + 1

  most_consistent_answer = max(answer_counts, key=answer_counts.get)

  return most_consistent_answer

def parse_final_answer(text):
  # Example parser: find "The answer is X" and extract X
  # This needs to be adapted based on the prompt format and task
  match = re.search(r"The answer is (.*?)\.?$", text)
  if match:
    return match.group(1).strip()
  return None # Or handle cases where the format isn't matched

Key advantages of self-consistency include:

Simplicity: It's easy to implement and requires no additional training, fine-tuning, auxiliary models, or human annotations.
Effectiveness: It works off-the-shelf with pre-trained LLMs.
Unsupervised: It doesn't rely on labeled data beyond the initial CoT prompts.
Self-Ensemble: It acts like an ensemble method but uses only a single model.

Extensive experiments were conducted across four LLMs (UL2-20B, GPT-3 175B, LaMDA-137B, PaLM-540B) on various arithmetic (GSM8K, SVAMP, AQuA), commonsense (StrategyQA, ARC-c), and symbolic reasoning benchmarks.

Key Findings:

Self-consistency significantly outperforms standard CoT prompting with greedy decoding across all tested models and benchmarks. For example, it achieved absolute accuracy gains of +17.9% on GSM8K, +11.0% on SVAMP, and +12.2% on AQuA with PaLM-540B or GPT-3 (code-davinci-002).
The performance gains generally increase with the scale of the LLM.
Sampling more diverse paths (e.g., up to 40) consistently improves accuracy, although diminishing returns are observed.
Self-consistency is robust to different sampling strategies (temperature, top-k, nucleus) and parameters.
It outperforms alternative approaches like sample-and-rank, beam search decoding, and prompt ensembling techniques (e.g., permuting prompt examples).
It can improve performance even on tasks where standard CoT prompting might hurt accuracy compared to direct prompting.
The degree of consistency (percentage of paths agreeing on the final answer) correlates with accuracy, potentially offering an uncertainty measure.
It shows robustness even when prompts contain imperfect reasoning steps or use non-natural language (e.g., equations) as intermediate steps.

Implementation Considerations:

Computational Cost: The primary drawback is the increased computational cost due to generating multiple reasoning paths. However, significant gains can often be achieved with a smaller number of paths (e.g., 5-10).
Answer Parsing: A reliable parser is needed to extract the final answer from the generated text, tailored to the specific prompt format and task.
Sampling Parameters: Appropriate sampling parameters (temperature, top-k) need to be chosen to encourage diversity without generating nonsensical paths. The paper suggests values like T=0.5-0.7 and k=40.

In conclusion, self-consistency provides a practical and robust method to significantly improve the reasoning performance of LLMs by sampling diverse reasoning paths and selecting the most frequent answer, requiring only minor changes to the decoding process compared to standard CoT prompting.