Papers
Topics
Authors
Recent
2000 character limit reached

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Published 1 Aug 2023 in cs.AI, cs.CL, and cs.LG | (2308.00436v3)

Abstract: The recent progress in LLMs, especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a general-purpose zero-shot verification schema for recognizing such errors. We then use the results of these checks to improve question-answering performance by conducting weighted voting on multiple solutions to the question. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.

Citations (84)

Summary

  • The paper introduces SelfCheck, a novel zero-shot verification method enabling LLMs to detect and correct their own reasoning errors.
  • It employs a structured four-step pipeline—target extraction, information collection, step regeneration, and result comparison—to enhance step accuracy.
  • Evaluation on GSM8K, MathQA, and MATH datasets demonstrates improved performance over standard voting schemes with more reliable confidence measures.

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

In the domain of automated reasoning and verification, the paper "SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning" proposes a novel methodology for enhancing the accuracy and reliability of LLMs by using them to verify their own outputs. The paper introduces a verification scheme named SelfCheck, designed to identify errors in step-by-step reasoning without relying on external datasets or specialized training.

Introduction to SelfCheck

SelfCheck is developed to address the inadequacies of LLMs when faced with complex problems requiring multi-step reasoning. Existing LLM approaches, such as Chain-of-Thought (CoT) prompting, have been effective in generating stepwise solutions but often lack the capability to detect and correct their mistakes in reasoning steps quantitatively. SelfCheck aims to fill this gap by introducing a zero-shot verification mechanism leveraging the LLM itself for error detection and correction.

The primary innovation of SelfCheck lies in its structured pipeline which decomposes the verification process into manageable stages. Instead of a direct verification which LLMs struggle with, SelfCheck applies four distinct steps: target extraction, information collection, step regeneration, and result comparison. Figure 1

Figure 1: Example of using SelfCheck, focusing on the checking of a particular step (Step 5), including extraction, collection, regeneration, and comparison stages.

Step Verification Methodology

Target Extraction and Information Collection

Target extraction isolates the goal of each reasoning step, prompting the LLM to succinctly define what the step intends to achieve. Information collection further fine-tunes the context by identifying only the relevant aspects of preceding steps or question data that contribute directly to this target.

Step Regeneration

In the step regeneration phase, the LLM is tasked with achieving the identified target independently using the gathered context, effectively regenerating the step's conclusion. This stage leverages the generative capabilities of the LLM, reducing biases and ensuring independence from the original output.

Result Comparison

The final comparison involves checking the consistency between the regenerated results and the original step. This process is critical as it determines whether the step is supported, contradicted, or unrelated to the regenerated outputs.

Evaluation and Results

The effectiveness of SelfCheck was validated using the GSM8K, MathQA, and MATH datasets. The system demonstrated improved accuracies through weighted voting mechanisms, outperforming simple majority voting and proving its utility across various complexity levels. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: The upper plots show the accuracies of SelfCheck and majority voting with GPT-3.5 for various solution counts; the lower plots indicate accuracy changes.

The verification capabilities of SelfCheck were assessed through threshold-based confidence classification, revealing a substantial increase in solution correctness upon filtering out low-confidence results. Figure 3

Figure 3

Figure 3

Figure 3: Proportions of correct solutions in predicted correct solutions across datasets GSM8K, MathQA, and MATH when thresholds are raised.

Comparative Analysis and Ablations

SelfCheck's architecture, especially its multi-stage verification design, was verified through ablation studies that showed its superiority over single-stage and direct error-checking methods. The explorations confirmed that decomposing the verification into well-defined stages yields better results than attempting holistic step evaluation.

Conclusion

SelfCheck introduces an effective methodology for self-contained reasoning verification within LLMs. By minimizing dependencies on external data, SelfCheck maintains versatility across domains. The paper’s experimentation confirms that the approach not only increases prediction accuracy but also provides more reliable confidence measures for individual solutions, showcasing significant advancements in automated reasoning and verification technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.