Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models Cannot Self-Correct Reasoning Yet (2310.01798v2)

Published 3 Oct 2023 in cs.CL and cs.AI
Large Language Models Cannot Self-Correct Reasoning Yet

Abstract: LLMs have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.

LLMs Cannot Self-Correct Reasoning Yet

The paper "LLMs Cannot Self-Correct Reasoning Yet" authored by Jie Huang et al. presents an in-depth investigation into the self-correction capability of LLMs in the context of reasoning tasks. This research critically evaluates whether LLMs can intrinsically correct their own outputs without external feedback, addressing an important question in the field of artificial intelligence.

Key Findings

The authors define intrinsic self-correction as the model's ability to identify and rectify its erroneous outputs based solely on its internal mechanisms, without relying on external input or labels. They conducted extensive experiments using prominent LLMs, including GPT-3.5, GPT-4, GPT-4-Turbo, and Llama-2, evaluating their performance on several reasoning benchmarks: GSM8K, CommonSenseQA, and HotpotQA. The paper's key findings are:

  1. Intrinsic Self-Correction Fails in Reasoning Tasks:
    • LLMs often fail to improve their answers in the intrinsic self-correction setting. In many cases, the performance even deteriorates after the self-correction attempts.
    • For instance, without external feedback, the performance of GPT-3.5 and GPT-4 consistently dropped across all tested benchmarks. This observation aligns with the primary conclusion that LLMs cannot inherently judge the correctness of their reasoning outputs effectively.
  2. Oracle Labels Skew Results:
    • Previous works indicated significant improvements when using oracle labels for self-correction. However, the use of oracle labels is impractical in real-world scenarios, as it presupposes access to the correct answers.
    • The results highlight a crucial distinction: the improvements seen in some studies are not due to the models' intrinsic abilities but rather the availability of oracle labels guiding the correction process.
  3. Multi-Agent Debate is Not Superior to Self-Consistency:
    • The paper compared the multi-agent debate approach to self-consistency on GSM8K. It was found that multi-agent debate does not offer significant advantages over self-consistency when considering the same number of model responses.
    • In fact, self-consistency using a majority voting mechanism often outperformed the multi-agent debate, suggesting that simple consensus-based approaches might be more effective for improving model performance.
  4. Prompt Design Issues:
    • The paper points out that some reported improvements in self-correction might be artifacts of suboptimal prompt design for generating initial responses. When initial prompts were more detailed and comprehensive, the purported benefits of self-correction significantly diminished.
    • For instance, in the Constrained Generation task, providing a clear and complete initial prompt led to better performance than adding details only in the self-correction phase.

Implications and Future Directions

The findings have several implications:

  1. Refined Evaluation Metrics:
    • Future research should rigorously evaluate self-correction methods against robust baselines like self-consistency, ensuring a fair comparison with equivalent inference costs.
  2. External Feedback Utilization:
    • Given the challenges of intrinsic self-correction, leveraging external feedback sources could offer more practical improvements. Future methods might integrate interactive components with external tools or human inputs to provide effective correction mechanisms.
  3. Training Verifiers:
    • Developing specialized verifier models trained on high-quality annotated datasets could aid LLMs in more accurately assessing the correctness of their outputs and providing meaningful self-correction feedback.
  4. Comprehensive Prompts:
    • Ensuring that initial prompts are as informative and detailed as possible is crucial for fair comparisons. Future studies should carefully design prompts to encapsulate the entire task requirements from the start.

In summary, while intrinsic self-correction remains an elusive goal for contemporary LLMs, this paper underscores the importance of realistic evaluation settings and the potential benefits of external feedback mechanisms. The community is encouraged to continue exploring innovative ways to enhance the self-correction capabilities of LLMs, keeping in mind the current limitations and practical considerations highlighted by this research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2), 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021.
  7. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023a.
  8. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023b.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
  11. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
  12. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16477–16508, 2023.
  13. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
  14. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023.
  15. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2038–2047, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  16. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  17. Language models can solve computer tasks. The ICML Workshop on Artificial Intelligence & Human Computer Interaction, 2023.
  18. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  19. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  20. Multi-step jailbreaking privacy attacks on chatgpt. ArXiv preprint, abs/2304.05197, 2023.
  21. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
  22. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  23. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 2023.
  24. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  28. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023.
  29. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070, 2021.
  30. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
  31. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  32. Quantifying association capabilities of large language models and its implications on privacy leakage. arXiv preprint arXiv:2305.12707, 2023.
  33. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  34. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 2023.
  35. Reinforcement learning: An introduction. MIT press, 2018.
  36. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, 2019.
  37. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023.
  38. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  39. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  40. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  41. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023.
  42. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  43. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
  44. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  45. Why does chatgpt fall short in providing truthful answers. ArXiv preprint, abs/2304.10513, 2023.
  46. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023a.
  47. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023b.
  48. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022.
  49. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jie Huang (155 papers)
  2. Xinyun Chen (80 papers)
  3. Swaroop Mishra (60 papers)
  4. Huaixiu Steven Zheng (11 papers)
  5. Adams Wei Yu (23 papers)
  6. Xinying Song (15 papers)
  7. Denny Zhou (65 papers)
Citations (285)
Youtube Logo Streamline Icon: https://streamlinehq.com