On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks (2402.08115v2)

Published 12 Feb 2024 in cs.AI

Abstract: There has been considerable divergence of opinion on the reasoning abilities of LLMs. While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning--there persists a wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity--which should be irrelevant to LLMs to the extent that what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting in the context of reasoning and planning. We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. We experiment both with the model critiquing its own answers and with an external correct reasoner verifying proposed solutions. In each case, we analyze whether the content of criticisms actually affects bottom line performance, and whether we can ablate elements of the augmented system without losing performance. We observe significant performance collapse with self-critique and significant performance gains with sound external verification. We also note that merely re-prompting with a sound verifier maintains most of the benefits of more involved setups.

PDF Abstract

On the Self-Verification Limitations of LLMs on Reasoning and Planning Tasks

The paper presents a comprehensive analysis of self-verification practices in LLMs concerning reasoning and planning tasks. Although LLMs have demonstrated proficiency in natural language generation, their reasoning capabilities, particularly in mathematical, logical, and planning tasks, have been a subject of debate. The authors address these debates by empirically evaluating the effectiveness of self-critiquing strategies using GPT-4 across three domains: Game of 24, Graph Coloring, and STRIPS planning.

Methodology and Findings

The traditional assumption that LLMs can enhance their outputs through iterative self-verification is critically examined. The notion that verification might be inherently easier than generation, inspired by computational complexity theory, is scrutinized in this context. The paper provides a structured empirical framework where both self-critiquing and external verification are applied, highlighting significant performance differences.

Game of 24: The Game of 24 is a straightforward task involving arithmetic operations. Despite its simplicity, self-critique failed to improve performance beyond initial attempts, predominantly due to hallucinated feedback from the model that did not contribute meaningfully to solution refinement.
Graph Coloring: As a classic combinatorial problem, graph coloring involves assigning colors to graph vertices, ensuring no two adjacent vertices share the same color. The findings indicate a high false negative rate in the self-verification process, suggesting that self-critiquing mechanisms were ineffective in recognizing valid solutions.
STRIPS Planning: This task requires forming sequences of actions that transition an initial state to a goal state. In this domain, the self-verification approach led to performance collapses, with frequent failures in executing viable plans correctly.

In contrast to the self-critiquing by the LLM, the inclusion of an external, sound verifier significantly enhanced performance across all tasks. This finding challenges the assumption that LLMs inherently benefit from iterative self-feedback. It also suggests that external, principled critique plays a crucial role in these domains where precise reasoning is required.

Implications and Future Directions

The paper posits that the marginal gains in performance due to self-critique are likely misattributed. The effective component in improving performance lies in providing LLMs with multiple opportunities to generate solutions while leveraging sound validation methods externally. This insight points towards developing hybrid systems where LLMs collaborate with deterministic verifiers, forming what the authors describe as LLM-Modulo systems.

The implications for future AI research and applications are significant. As models continue to grow in scale and complexity, understanding their limitations and optimizing their integration with other systems will be crucial. Future developments could explore creating more robust synergies between LLMs and symbolic or rule-based reasoners, potentially leading to advancements in AI's ability to handle more intricate reasoning tasks reliably.

This work contributes to ongoing discussions around the efficacy and future trajectory of LLMs in complex problem-solving environments. By providing an empirical basis for reevaluating self-critiquing methodologies, it encourages the exploration of more reliable architectures that transcend mere iterative re-prompting, thus paving the way for more nuanced AI applications in reasoning-intensive domains.