Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs cannot find reasoning errors, but can correct them given the error location (2311.08516v3)

Published 14 Nov 2023 in cs.AI, cs.CL, and cs.LG

Abstract: While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023b; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we show that poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake. Firstly, we benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task, even in highly objective, unambiguous cases. Secondly, we test the correction abilities of LLMs -- separately from mistake finding -- using a backtracking setup that feeds ground truth mistake location information to the model. We show that this boosts downstream task performance across our 5 reasoning tasks, indicating that LLMs' correction abilities are robust. Finally, we show that it is possible to obtain mistake location information without ground truth labels or in-domain training data. We train a small classifier with out-of-domain data, which exhibits stronger mistake-finding performance than prompting a large model. We release our dataset of LLM-generated logical mistakes, BIG-Bench Mistake, to enable further research into locating LLM reasoning mistakes.

Overview of "LLMs cannot find reasoning errors, but can correct them!"

The paper, "LLMs cannot find reasoning errors, but can correct them!" presents a nuanced examination of the self-correction capabilities of LLMs in the context of logical and reasoning tasks. The authors demonstrate that although LLMs exhibit capacities to improve outputs related to style and quality, their proficiency in identifying and correcting logical errors is limited without explicit feedback. The research underscores the dual components of self-correction: mistake finding and output correction, exploring these facets through empirical evaluation and a constructive proposition for future methodology.

Methodology and Data

The authors introduce the BIG-Bench Mistake dataset, designed to evaluate the mistake-finding abilities of LLMs. This dataset comprises 2,186 Chain-of-Thought (CoT) traces for tasks like word sorting, tracking shuffled objects, logical deduction, multistep arithmetic, and Dyck languages. Each trace annotates the location of the first logical mistake, thus furnishing a benchmark for assessing the models' reasoning capabilities. The authors utilize state-of-the-art LLMs, including iterations of GPT models, to benchmark performance on this novel dataset, revealing a general struggle in reliably identifying mistakes, especially in objective and unambiguous cases.

Numerical Findings

The paper emphasizes that despite LLMs' overall high proficiency in text generation, the models demonstrate limited ability to find mistakes, as reflected in the discrepancies between human annotation—characterized by a high agreement rate (Krippendorff's alpha)—and model performance. Notably, the achieved accuracy of models like GPT-4 does not exceed 52.87% when directly tasked with locating logical errors, indicating a tangible gap relative to human performance.

Proposed Solution: Backtracking

To remedy the identified deficiency in mistake finding, the paper proposes a novel backtracking approach. This method capitalizes on mistake location information and showcases significant improvements in output correction. By adapting a "verbal reinforcement learning" framework, backtracking utilizes a lightweight reward model to guide the iterative correction of reasoning errors without modifying the generating LLM's weights. This approach promises large improvements, even when the reward model operates at 60-70% accuracy, marking an advancement over prior methods that rely heavily on oracle feedback.

Implications and Future Directions

The findings delineate important practical and theoretical implications. Practically, the proposed method offers a scalable, less resource-intensive approach to improving LLM outputs in settings lacking external feedback mechanisms. Theoretically, the research underscores an area where LLMs have not yet reached human-level performance—logical mistake detection and correction—thereby targeting a critical path for future model enhancement.

The paper invites the research community to pursue enhanced methods in mistake finding and suggests potential cross-model evaluations and further refinement of backtracking using learned reward models. Moreover, the exploration of more realistic tasks and more comprehensive datasets could provide broader insights into LLMs’ reasoning capabilities and the generalizability of self-correction methods.

In conclusion, while LLMs have not yet mastered the art of self-correction for reasoning errors, the paper presents a compelling roadmap for achieving substantial improvements through strategic innovations like backtracking. The insights garnered from this work beckon further investigation and development, solidifying the paper’s contribution as pivotal in the ongoing evolution of LLM performance metrics and capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Iterative translation refinement with large language models. arXiv preprint arXiv:2306.03856.
  3. Andrew F Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication methods and measures, 1(1):77–89.
  4. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  5. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  6. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  7. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  8. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  9. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436.
  10. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  11. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188.
  12. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  13. Reflexion: Language agents with verbal reinforcement learning.
  14. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  15. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  16. Big-bench-hard/cot-prompts. https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts. Accessed: 2023-10-31.
  17. Self-consistency improves chain of thought reasoning in language models. In ICLR 2023.
  18. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  19. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Gladys Tyen (3 papers)
  2. Hassan Mansoor (8 papers)
  3. Peter Chen (9 papers)
  4. Tony Mak (2 papers)
  5. Victor Cărbune (4 papers)
Citations (48)
Youtube Logo Streamline Icon: https://streamlinehq.com