LLMs cannot find reasoning errors, but can correct them given the error location (2311.08516v3)

Published 14 Nov 2023 in cs.AI, cs.CL, and cs.LG

Abstract: While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023b; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we show that poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake. Firstly, we benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task, even in highly objective, unambiguous cases. Secondly, we test the correction abilities of LLMs -- separately from mistake finding -- using a backtracking setup that feeds ground truth mistake location information to the model. We show that this boosts downstream task performance across our 5 reasoning tasks, indicating that LLMs' correction abilities are robust. Finally, we show that it is possible to obtain mistake location information without ground truth labels or in-domain training data. We train a small classifier with out-of-domain data, which exhibits stronger mistake-finding performance than prompting a large model. We release our dataset of LLM-generated logical mistakes, BIG-Bench Mistake, to enable further research into locating LLM reasoning mistakes.

PDF HTML Abstract

Overview of "LLMs cannot find reasoning errors, but can correct them!"

The paper, "LLMs cannot find reasoning errors, but can correct them!" presents a nuanced examination of the self-correction capabilities of LLMs in the context of logical and reasoning tasks. The authors demonstrate that although LLMs exhibit capacities to improve outputs related to style and quality, their proficiency in identifying and correcting logical errors is limited without explicit feedback. The research underscores the dual components of self-correction: mistake finding and output correction, exploring these facets through empirical evaluation and a constructive proposition for future methodology.

Methodology and Data

The authors introduce the BIG-Bench Mistake dataset, designed to evaluate the mistake-finding abilities of LLMs. This dataset comprises 2,186 Chain-of-Thought (CoT) traces for tasks like word sorting, tracking shuffled objects, logical deduction, multistep arithmetic, and Dyck languages. Each trace annotates the location of the first logical mistake, thus furnishing a benchmark for assessing the models' reasoning capabilities. The authors utilize state-of-the-art LLMs, including iterations of GPT models, to benchmark performance on this novel dataset, revealing a general struggle in reliably identifying mistakes, especially in objective and unambiguous cases.

Numerical Findings

The paper emphasizes that despite LLMs' overall high proficiency in text generation, the models demonstrate limited ability to find mistakes, as reflected in the discrepancies between human annotation—characterized by a high agreement rate (Krippendorff's alpha)—and model performance. Notably, the achieved accuracy of models like GPT-4 does not exceed 52.87% when directly tasked with locating logical errors, indicating a tangible gap relative to human performance.

Proposed Solution: Backtracking

To remedy the identified deficiency in mistake finding, the paper proposes a novel backtracking approach. This method capitalizes on mistake location information and showcases significant improvements in output correction. By adapting a "verbal reinforcement learning" framework, backtracking utilizes a lightweight reward model to guide the iterative correction of reasoning errors without modifying the generating LLM's weights. This approach promises large improvements, even when the reward model operates at 60-70% accuracy, marking an advancement over prior methods that rely heavily on oracle feedback.

Implications and Future Directions

The findings delineate important practical and theoretical implications. Practically, the proposed method offers a scalable, less resource-intensive approach to improving LLM outputs in settings lacking external feedback mechanisms. Theoretically, the research underscores an area where LLMs have not yet reached human-level performance—logical mistake detection and correction—thereby targeting a critical path for future model enhancement.

The paper invites the research community to pursue enhanced methods in mistake finding and suggests potential cross-model evaluations and further refinement of backtracking using learned reward models. Moreover, the exploration of more realistic tasks and more comprehensive datasets could provide broader insights into LLMs’ reasoning capabilities and the generalizability of self-correction methods.

In conclusion, while LLMs have not yet mastered the art of self-correction for reasoning errors, the paper presents a compelling roadmap for achieving substantial improvements through strategic innovations like backtracking. The insights garnered from this work beckon further investigation and development, solidifying the paper’s contribution as pivotal in the ongoing evolution of LLM performance metrics and capabilities.