Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models (2402.08955v1)

Published 14 Feb 2024 in cs.AI and cs.CL
Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Abstract: LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.

Evaluating LLMs on Counterfactual Analogical Reasoning Tasks

Introduction to the Study

The scalability and generalization capabilities of LLMs like GPT have been subjects of both admiration and scrutiny within the AI research community. A critical examination of LLMs' performance on analogical reasoning tasks suggests a nuanced understanding of their cognitive abilities. This paper explores the generality of LLMs' analogical reasoning, comparing their performance on standard and counterfactual tasks against human performance. Specifically, it focuses on the abilities of various GPT models in handling letter-string analogy problems under both familiar and unfamiliar (counterfactual) contexts.

Methodology and Experiment Design

The researchers conducted comprehensive testing involving human participants and three iterations of OpenAI's GPT models (GPT-3, GPT-3.5, and GPT-4) on a series of analogy problems. The original set of problems replicated those used in previous studies, while the counterfactual variants introduced permutations in the alphabets and the inclusion of non-letter symbols. This approach aimed to ascertain if LLMs could extend their analogical reasoning beyond patterns likely absorbed during training to novel, unseen formats.

Human Participants

The paper solicited the participation of 136 individuals, ensuring a representative sample with diverse linguistic backgrounds. Each participant faced a selection of analogy problems, some utilizing traditional alphabets and others employing permuted or symbolic sequences. This approach mirrored the test conditions for the LLMs, allowing for a direct comparison of human and machine analogical reasoning abilities.

LLMs

Three versions of the Generative Pre-trained Transformer models were evaluated: GPT-3, GPT-3.5, and GPT-4. Each model was subjected to the same sets of original and counterfactual analogy problems, with adjustments made to accommodate the models' input requirements. The paper meticulously verified the models' comprehension of the tasks, particularly in scenarios involving unfamiliar alphabets, through a series of comprehension checks.

Results

The findings are multifaceted and reveal significant differences between human and LLM performances across various tasks. Humans maintained consistent accuracy across both standard and counterfactual tasks, displaying adaptability to unfamiliar alphabets and symbols. In contrast, LLMs exhibited a marked decrease in performance on counterfactual tasks, struggling notably with sequences that deviated from their training data paradigms.

Discussion

The paper meticulously categorizes the types of errors encountered by both humans and LLMs, providing insightful analysis into the nature of mistakes and the implications for analogical reasoning capabilities. It becomes apparent that while humans apply a range of strategies and exhibit flexibility in their reasoning, LLMs tend to falter in contexts that likely lie outside their training data's scope.

Conclusions and Future Work

This comprehensive analysis underscores the limitations of current LLMs in adapting their reasoning abilities to novel contexts. The findings suggest that despite remarkable advancements, LLMs such as GPT still fall short of human-like analogical reasoning when presented with tasks that require generalization beyond familiar patterns. The paper advocates for future research that explores the mechanisms of response formation in both humans and machines, potentially unlocking new avenues for enhancing the cognitive abilities of LLMs.

The implications of this research span both theoretical and practical realms, offering critical insights for the development of LLMs capable of nuanced reasoning and abstract thought. It sets a foundational benchmark for assessing the generality of analogical reasoning in machine intelligence, motivating ongoing exploration within the AI community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. (2023). Leaping across the mental canyon: Higher-order long-distance analogical retrieval. Journal of Cognitive Psychology, 35(8), 856–875.
  2. (2023). Faith and fate: Limits of transformers on compositionality. In Proceedings of the Thirty-seventh Annual Conference on Neural Information Processing Systems (NeurIPS).
  3. (2010). Functional neural correlates of fluid and crystallized analogizing. NeuroImage, 49(4), 3489–3497.
  4. (2023). Response: Emergent analogical reasoning in large language models. arXiv preprint arXiv:2308.16118.
  5. Hofstadter, D. R.  (1985). Metamagical Themas: Questing for the Essence of Mind and Pattern. In (chap. 24). New York, NY: Basic Books.
  6. (1994). The Copycat project: A model of mental fluidity and analogy-making. In K. J. Holyoak  J. A. Barnden (Eds.), Advances in Connectionist and Neural Computation Theory (Vol. 2, pp. 31–112). Ablex, Norwood, NJ.
  7. (2022). Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
  8. Kambhampati, S.  (2023). Can LLMs really reason and plan? Communications of the ACM. (https://cacm.acm.org/blogs/blog-cacm/276268-can-llms-really-reason-and-plan/fulltext)
  9. (2015). Event-related potential responses to letter-string comparison analogies. Experimental Brain Research, 233, 1563–1573.
  10. (2023). Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.
  11. Mitchell, M.  (1993). Analogy-Making As Perception: A Computer Model. In (chap. 5). Cambridge, MA: MIT Press.
  12. (2022). Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 840–854).
  13. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541.
  14. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research.
  15. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
  16. (2023). Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Martha Lewis (31 papers)
  2. Melanie Mitchell (28 papers)
Citations (16)
Youtube Logo Streamline Icon: https://streamlinehq.com