Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models (2402.08955v1)

Published 14 Feb 2024 in cs.AI and cs.CL

Abstract: LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.

PDF HTML Abstract

Evaluating LLMs on Counterfactual Analogical Reasoning Tasks

Introduction to the Study

The scalability and generalization capabilities of LLMs like GPT have been subjects of both admiration and scrutiny within the AI research community. A critical examination of LLMs' performance on analogical reasoning tasks suggests a nuanced understanding of their cognitive abilities. This paper explores the generality of LLMs' analogical reasoning, comparing their performance on standard and counterfactual tasks against human performance. Specifically, it focuses on the abilities of various GPT models in handling letter-string analogy problems under both familiar and unfamiliar (counterfactual) contexts.

Methodology and Experiment Design

The researchers conducted comprehensive testing involving human participants and three iterations of OpenAI's GPT models (GPT-3, GPT-3.5, and GPT-4) on a series of analogy problems. The original set of problems replicated those used in previous studies, while the counterfactual variants introduced permutations in the alphabets and the inclusion of non-letter symbols. This approach aimed to ascertain if LLMs could extend their analogical reasoning beyond patterns likely absorbed during training to novel, unseen formats.

Human Participants

The paper solicited the participation of 136 individuals, ensuring a representative sample with diverse linguistic backgrounds. Each participant faced a selection of analogy problems, some utilizing traditional alphabets and others employing permuted or symbolic sequences. This approach mirrored the test conditions for the LLMs, allowing for a direct comparison of human and machine analogical reasoning abilities.

LLMs

Three versions of the Generative Pre-trained Transformer models were evaluated: GPT-3, GPT-3.5, and GPT-4. Each model was subjected to the same sets of original and counterfactual analogy problems, with adjustments made to accommodate the models' input requirements. The paper meticulously verified the models' comprehension of the tasks, particularly in scenarios involving unfamiliar alphabets, through a series of comprehension checks.

Results

The findings are multifaceted and reveal significant differences between human and LLM performances across various tasks. Humans maintained consistent accuracy across both standard and counterfactual tasks, displaying adaptability to unfamiliar alphabets and symbols. In contrast, LLMs exhibited a marked decrease in performance on counterfactual tasks, struggling notably with sequences that deviated from their training data paradigms.

Discussion

The paper meticulously categorizes the types of errors encountered by both humans and LLMs, providing insightful analysis into the nature of mistakes and the implications for analogical reasoning capabilities. It becomes apparent that while humans apply a range of strategies and exhibit flexibility in their reasoning, LLMs tend to falter in contexts that likely lie outside their training data's scope.

Conclusions and Future Work

This comprehensive analysis underscores the limitations of current LLMs in adapting their reasoning abilities to novel contexts. The findings suggest that despite remarkable advancements, LLMs such as GPT still fall short of human-like analogical reasoning when presented with tasks that require generalization beyond familiar patterns. The paper advocates for future research that explores the mechanisms of response formation in both humans and machines, potentially unlocking new avenues for enhancing the cognitive abilities of LLMs.

The implications of this research span both theoretical and practical realms, offering critical insights for the development of LLMs capable of nuanced reasoning and abstract thought. It sets a foundational benchmark for assessing the generality of analogical reasoning in machine intelligence, motivating ongoing exploration within the AI community.