Can Language Models Solve Graph Problems in Natural Language? (2305.10037v3)

Published 17 May 2023 in cs.CL and cs.AI

Abstract: LLMs are increasingly adopted for a variety of tasks with implicit graphical structures, such as planning in robotics, multi-hop question answering or knowledge probing, structured commonsense reasoning, and more. While LLMs have advanced the state-of-the-art on these tasks with structure implications, whether LLMs could explicitly process textual descriptions of graphs and structures, map them to grounded conceptual spaces, and perform structured operations remains underexplored. To this end, we propose NLGraph (Natural Language Graph), a comprehensive benchmark of graph-based problem solving designed in natural language. NLGraph contains 29,370 problems, covering eight graph reasoning tasks with varying complexity from simple tasks such as connectivity and shortest path up to complex problems such as maximum flow and simulating graph neural networks. We evaluate LLMs (GPT-3/4) with various prompting approaches on the NLGraph benchmark and find that 1) LLMs do demonstrate preliminary graph reasoning abilities, 2) the benefit of advanced prompting and in-context learning diminishes on more complex graph problems, while 3) LLMs are also (un)surprisingly brittle in the face of spurious correlations in graph and problem settings. We then propose Build-a-Graph Prompting and Algorithmic Prompting, two instruction-based approaches to enhance LLMs in solving natural language graph problems. Build-a-Graph and Algorithmic prompting improve the performance of LLMs on NLGraph by 3.07% to 16.85% across multiple tasks and settings, while how to solve the most complicated graph reasoning tasks in our setup with LLMs remains an open research question. The NLGraph benchmark and evaluation code are available at https://github.com/Arthur-Heng/NLGraph.

PDF Abstract

Evaluating Graph Reasoning Abilities of LLMs

The paper "Can LLMs Solve Graph Problems in Natural Language?" presents a comprehensive analysis of LLMs in tackling graph-based reasoning tasks expressed in natural language. The central aim of the work is to assess whether LLMs exhibit the capability to process textual descriptions of graph structures, translate them into conceptual frameworks, and execute structured operations. To facilitate this evaluation, the authors introduce "NLGraph," a benchmark consisting of a series of tasks of varying complexity intended to test graph reasoning in natural language.

NLGraph encompasses 29,370 problems across eight algorithmic tasks: connectivity, cycle detection, topological sort, shortest path, maximum flow, bipartite matching, Hamiltonian path, and simulation of graph neural networks. These tasks are designed to rigorously evaluate the ability of LLMs to perform structured reasoning based on graph algorithms. The benchmark stratifies problems into levels of difficulty—namely, easy, medium, and hard—to enable a fine-grained analysis of the LLMs' capabilities across varying complexities. Additionally, NLGraph assesses not only binary correctness (exact match accuracy) but also implements partial credit methods to capture nuances in solution quality beyond binary success metrics.

Empirical evaluations of prominent LLMs, particularly GPT-3/4 models, are undertaken using various prompting methods, including zero-shot, few-shot, chain-of-thought (CoT), and self-consistency (SC) approaches. The experimental results reveal several insights:

Preliminary Graph Reasoning Abilities: LLMs exhibit a reasonable degree of success in solving simpler graph reasoning tasks, achieving performance significantly above the random baseline, specifically in the range of 37.33% to 57.82% above chance level.
Effectiveness of Prompting Techniques: While prompting strategies such as CoT and self-consistency provide enhancements on simpler tasks, their advantages diminish when faced with more complex graph reasoning challenges, suggesting that LLMs may struggle to properly generate interoperable steps and learn effectively from exemplars in these contexts.
Brittleness to Spurious Correlations: The models perform markedly worse on deliberately structured special cases within the connectivity task, indicating a reliance on spurious correlations rather than robust logical inference.

To address these limitations and enhance the reasoning capability of LLMs for graph-related tasks, the authors propose two new instruction-based prompting methodologies: Build-a-Graph Prompting and Algorithmic Prompting. These approaches are designed to guide LLMs in constructing mental representations of graph structures before solving problems, and reflecting on the relevant algorithmic process. Experiments demonstrate that these methods can lead to modest performance gains, with improvements ranging from 3.07% to 16.85% on easier graph tasks.

The findings underscore both the potential and limitations of current LLMs when applied to graph-based reasoning problems expressed in natural language. The authors concede that the most complex graph reasoning tasks remain an open question, inviting further research to devise solutions beyond prompting adjustments. The availability of the NLGraph benchmark and its associated evaluation metrics provides a valuable resource for the ongoing exploration and enhancement of LLMs' capabilities in structured and graph algorithm reasoning. This benchmark serves as a critical tool for researchers aiming to deeply explore and expand the boundaries of AI reasoning, particularly in the context of translating natural language descriptions into formal graph operations.

Overall, this paper furnishes foundational insights and resources pivotal for advancing the application of LLMs to more structured, algorithmically intensive tasks, advocating for an exploration of both lateral and vertical improvements in LLM prompting protocols and reasoning frameworks. Such work is instrumental for the future development of AI systems capable of more nuanced, reliable, and robust decision-making in tasks that inherently involve complex logical structures.