This paper introduces self-consistency, a simple yet effective decoding strategy designed to enhance the reasoning capabilities of LLMs when used with chain-of-thought (CoT) prompting (Wei et al., 2022 ). While CoT prompting improves reasoning by guiding models to generate step-by-step thought processes, it typically relies on greedy decoding, which selects the single most probable reasoning path. This can lead to suboptimal or incorrect answers.
Self-consistency addresses this limitation by leveraging the intuition that complex reasoning problems often have multiple valid paths to the correct solution. Incorrect reasoning processes, even if plausible locally, are less likely to converge consistently on the same wrong answer.
The self-consistency method involves three main steps:
- Prompting: An LLM is prompted with a question using standard few-shot CoT exemplars, which include questions, reasoning steps, and answers.
- Sampling Diverse Paths: Instead of greedily decoding a single reasoning path, the method samples multiple (e.g., 40) diverse candidate reasoning paths from the LLM's decoder. This is achieved using standard sampling techniques like temperature sampling, top-k sampling, or nucleus sampling. Each sampled output contains both a reasoning path () and a final answer ().
- Aggregation and Selection: The final answers () derived from the diverse set of reasoning paths are aggregated. The method then marginalizes out the reasoning paths by selecting the answer that appears most frequently (i.e., the most consistent answer) via a majority vote.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
def self_consistency(prompt, question, model, num_paths, sampling_params): """ Implements the self-consistency decoding strategy. Args: prompt: Few-shot CoT prompt examples. question: The input question to answer. model: The LLM. num_paths: The number of diverse reasoning paths to sample. sampling_params: Parameters for the sampling strategy (e.g., temperature, top_k). Returns: The most consistent final answer. """ full_prompt = prompt + "\nQ: " + question + "\nA:" generated_outputs = [] # 1. Sample diverse reasoning paths for _ in range(num_paths): # Generate a complete reasoning path and answer output = model.generate(full_prompt, **sampling_params) generated_outputs.append(output) # 2. Extract final answers final_answers = [] for output in generated_outputs: # Task-specific parser to extract the final answer from the generated text answer = parse_final_answer(output) if answer is not None: final_answers.append(answer) if not final_answers: return None # Or handle appropriately # 3. Aggregate and find the most consistent answer (majority vote) answer_counts = {} for answer in final_answers: answer_counts[answer] = answer_counts.get(answer, 0) + 1 most_consistent_answer = max(answer_counts, key=answer_counts.get) return most_consistent_answer def parse_final_answer(text): # Example parser: find "The answer is X" and extract X # This needs to be adapted based on the prompt format and task match = re.search(r"The answer is (.*?)\.?$", text) if match: return match.group(1).strip() return None # Or handle cases where the format isn't matched |
Key advantages of self-consistency include:
- Simplicity: It's easy to implement and requires no additional training, fine-tuning, auxiliary models, or human annotations.
- Effectiveness: It works off-the-shelf with pre-trained LLMs.
- Unsupervised: It doesn't rely on labeled data beyond the initial CoT prompts.
- Self-Ensemble: It acts like an ensemble method but uses only a single model.
Extensive experiments were conducted across four LLMs (UL2-20B, GPT-3 175B, LaMDA-137B, PaLM-540B) on various arithmetic (GSM8K, SVAMP, AQuA), commonsense (StrategyQA, ARC-c), and symbolic reasoning benchmarks.
Key Findings:
- Self-consistency significantly outperforms standard CoT prompting with greedy decoding across all tested models and benchmarks. For example, it achieved absolute accuracy gains of +17.9% on GSM8K, +11.0% on SVAMP, and +12.2% on AQuA with PaLM-540B or GPT-3 (code-davinci-002).
- The performance gains generally increase with the scale of the LLM.
- Sampling more diverse paths (e.g., up to 40) consistently improves accuracy, although diminishing returns are observed.
- Self-consistency is robust to different sampling strategies (temperature, top-k, nucleus) and parameters.
- It outperforms alternative approaches like sample-and-rank, beam search decoding, and prompt ensembling techniques (e.g., permuting prompt examples).
- It can improve performance even on tasks where standard CoT prompting might hurt accuracy compared to direct prompting.
- The degree of consistency (percentage of paths agreeing on the final answer) correlates with accuracy, potentially offering an uncertainty measure.
- It shows robustness even when prompts contain imperfect reasoning steps or use non-natural language (e.g., equations) as intermediate steps.
Implementation Considerations:
- Computational Cost: The primary drawback is the increased computational cost due to generating multiple reasoning paths. However, significant gains can often be achieved with a smaller number of paths (e.g., 5-10).
- Answer Parsing: A reliable parser is needed to extract the final answer from the generated text, tailored to the specific prompt format and task.
- Sampling Parameters: Appropriate sampling parameters (temperature, top-k) need to be chosen to encourage diversity without generating nonsensical paths. The paper suggests values like T=0.5-0.7 and k=40.
In conclusion, self-consistency provides a practical and robust method to significantly improve the reasoning performance of LLMs by sampling diverse reasoning paths and selecting the most frequent answer, requiring only minor changes to the decoding process compared to standard CoT prompting.