Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
Abstract: We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
- F. Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
- F. Chollet. The Abstraction and Reasoning Corpus (ARC). https://github.com/fchollet/ARC, 2023. Accessed 2023-11-09.
- A. de Miquel Bleier. Finishing 2nd in Kaggle’s Abstraction and Reasoning Challenge. https://blog.jovian.com/finishing-2nd-in-kaggles-abstraction-and-reasoning-challenge-24e59c07b50a, 2020. Accessed 2023-11-09.
- Large language models are not strong abstract reasoners. arXiv preprint arXiv:2305.19555, 2023.
- Fast and flexible: Human program induction in abstract reasoning tasks. arXiv preprint arXiv:2103.05823, 2021.
- Kaggle.com. Kaggle Abstraction and Reasoning Challenge. https://www.kaggle.com/c/abstraction-and-reasoning-challenge, 2020. Accessed 2023-11-09.
- S. Kambhampati. Can llms really reason and plan? Communications of the ACM, 2023. September 12, 2023.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
- Large language models as general pattern machines. In Seventh Conference on Robot Learning (CoRL 2023), 2023.
- The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. Transactions On Machine Learning Research, 2023.
- Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, 2022.
- Core knowledge. Developmental Science, 10(1):89–96, 2007.
- C. M. Walker and A. Gopnik. Toddlers infer higher-order relational principles in causal learning. Psychological Science, 25(1):161–169, 2014.
- Hypothesis search: Inductive reasoning with language models. arXiv preprint arXiv:2309.05660, 2023.
- Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- J. S. Wind. 1st place solution + code and official documentation. https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/discussion/154597, 2020. Accessed 2023-11-09.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
- Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations. arXiv preprint arXiv:2305.18354, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.