Evaluating Graph Reasoning Abilities of LLMs
The paper "Can LLMs Solve Graph Problems in Natural Language?" presents a comprehensive analysis of LLMs in tackling graph-based reasoning tasks expressed in natural language. The central aim of the work is to assess whether LLMs exhibit the capability to process textual descriptions of graph structures, translate them into conceptual frameworks, and execute structured operations. To facilitate this evaluation, the authors introduce "NLGraph," a benchmark consisting of a series of tasks of varying complexity intended to test graph reasoning in natural language.
NLGraph encompasses 29,370 problems across eight algorithmic tasks: connectivity, cycle detection, topological sort, shortest path, maximum flow, bipartite matching, Hamiltonian path, and simulation of graph neural networks. These tasks are designed to rigorously evaluate the ability of LLMs to perform structured reasoning based on graph algorithms. The benchmark stratifies problems into levels of difficulty—namely, easy, medium, and hard—to enable a fine-grained analysis of the LLMs' capabilities across varying complexities. Additionally, NLGraph assesses not only binary correctness (exact match accuracy) but also implements partial credit methods to capture nuances in solution quality beyond binary success metrics.
Empirical evaluations of prominent LLMs, particularly GPT-3/4 models, are undertaken using various prompting methods, including zero-shot, few-shot, chain-of-thought (CoT), and self-consistency (SC) approaches. The experimental results reveal several insights:
- Preliminary Graph Reasoning Abilities: LLMs exhibit a reasonable degree of success in solving simpler graph reasoning tasks, achieving performance significantly above the random baseline, specifically in the range of 37.33% to 57.82% above chance level.
- Effectiveness of Prompting Techniques: While prompting strategies such as CoT and self-consistency provide enhancements on simpler tasks, their advantages diminish when faced with more complex graph reasoning challenges, suggesting that LLMs may struggle to properly generate interoperable steps and learn effectively from exemplars in these contexts.
- Brittleness to Spurious Correlations: The models perform markedly worse on deliberately structured special cases within the connectivity task, indicating a reliance on spurious correlations rather than robust logical inference.
To address these limitations and enhance the reasoning capability of LLMs for graph-related tasks, the authors propose two new instruction-based prompting methodologies: Build-a-Graph Prompting and Algorithmic Prompting. These approaches are designed to guide LLMs in constructing mental representations of graph structures before solving problems, and reflecting on the relevant algorithmic process. Experiments demonstrate that these methods can lead to modest performance gains, with improvements ranging from 3.07% to 16.85% on easier graph tasks.
The findings underscore both the potential and limitations of current LLMs when applied to graph-based reasoning problems expressed in natural language. The authors concede that the most complex graph reasoning tasks remain an open question, inviting further research to devise solutions beyond prompting adjustments. The availability of the NLGraph benchmark and its associated evaluation metrics provides a valuable resource for the ongoing exploration and enhancement of LLMs' capabilities in structured and graph algorithm reasoning. This benchmark serves as a critical tool for researchers aiming to deeply explore and expand the boundaries of AI reasoning, particularly in the context of translating natural language descriptions into formal graph operations.
Overall, this paper furnishes foundational insights and resources pivotal for advancing the application of LLMs to more structured, algorithmically intensive tasks, advocating for an exploration of both lateral and vertical improvements in LLM prompting protocols and reasoning frameworks. Such work is instrumental for the future development of AI systems capable of more nuanced, reliable, and robust decision-making in tasks that inherently involve complex logical structures.