Large Language Models Cannot Self-Correct Reasoning Yet (2310.01798v2)

Published 3 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.

PDF HTML Abstract

LLMs Cannot Self-Correct Reasoning Yet

The paper "LLMs Cannot Self-Correct Reasoning Yet" authored by Jie Huang et al. presents an in-depth investigation into the self-correction capability of LLMs in the context of reasoning tasks. This research critically evaluates whether LLMs can intrinsically correct their own outputs without external feedback, addressing an important question in the field of artificial intelligence.

Key Findings

The authors define intrinsic self-correction as the model's ability to identify and rectify its erroneous outputs based solely on its internal mechanisms, without relying on external input or labels. They conducted extensive experiments using prominent LLMs, including GPT-3.5, GPT-4, GPT-4-Turbo, and Llama-2, evaluating their performance on several reasoning benchmarks: GSM8K, CommonSenseQA, and HotpotQA. The paper's key findings are:

Intrinsic Self-Correction Fails in Reasoning Tasks:
- LLMs often fail to improve their answers in the intrinsic self-correction setting. In many cases, the performance even deteriorates after the self-correction attempts.
- For instance, without external feedback, the performance of GPT-3.5 and GPT-4 consistently dropped across all tested benchmarks. This observation aligns with the primary conclusion that LLMs cannot inherently judge the correctness of their reasoning outputs effectively.
Oracle Labels Skew Results:
- Previous works indicated significant improvements when using oracle labels for self-correction. However, the use of oracle labels is impractical in real-world scenarios, as it presupposes access to the correct answers.
- The results highlight a crucial distinction: the improvements seen in some studies are not due to the models' intrinsic abilities but rather the availability of oracle labels guiding the correction process.
Multi-Agent Debate is Not Superior to Self-Consistency:
- The paper compared the multi-agent debate approach to self-consistency on GSM8K. It was found that multi-agent debate does not offer significant advantages over self-consistency when considering the same number of model responses.
- In fact, self-consistency using a majority voting mechanism often outperformed the multi-agent debate, suggesting that simple consensus-based approaches might be more effective for improving model performance.
Prompt Design Issues:
- The paper points out that some reported improvements in self-correction might be artifacts of suboptimal prompt design for generating initial responses. When initial prompts were more detailed and comprehensive, the purported benefits of self-correction significantly diminished.
- For instance, in the Constrained Generation task, providing a clear and complete initial prompt led to better performance than adding details only in the self-correction phase.

Implications and Future Directions

The findings have several implications:

Refined Evaluation Metrics:
- Future research should rigorously evaluate self-correction methods against robust baselines like self-consistency, ensuring a fair comparison with equivalent inference costs.
External Feedback Utilization:
- Given the challenges of intrinsic self-correction, leveraging external feedback sources could offer more practical improvements. Future methods might integrate interactive components with external tools or human inputs to provide effective correction mechanisms.
Training Verifiers:
- Developing specialized verifier models trained on high-quality annotated datasets could aid LLMs in more accurately assessing the correctness of their outputs and providing meaningful self-correction feedback.
Comprehensive Prompts:
- Ensuring that initial prompts are as informative and detailed as possible is crucial for fair comparisons. Future studies should carefully design prompts to encapsulate the entire task requirements from the start.

In summary, while intrinsic self-correction remains an elusive goal for contemporary LLMs, this paper underscores the importance of realistic evaluation settings and the potential benefits of external feedback mechanisms. The community is encouraged to continue exploring innovative ways to enhance the self-correction capabilities of LLMs, keeping in mind the current limitations and practical considerations highlighted by this research.