The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks (2410.07391v1)

Published 9 Oct 2024 in cs.AI

Abstract: There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading LLMs and vision LLMs against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of VerbalComprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Smaller and older model versions consistently performed worse, indicating that training data, parameter count and advances in tuning are resulting in significant advances in cognitive ability.

PDF HTML Abstract

Cognitive Capabilities of Generative AI: A Comparative Analysis

The paper "The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks" presents a detailed examination of generative AI models benchmarked against human cognitive assessments using the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV). This rigorous paper explores the performance of LLMs and vision-LLMs (VLMs) across several cognitive domains, focusing on Verbal Comprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI).

Overview of Findings

The research demonstrates that current AIs exhibit exceptional performance in specific cognitive tasks compared to human benchmarks, particularly in verbal comprehension and working memory. The AI models consistently achieved scores at or above the 98th percentile in these areas. However, a stark contrast is observed in perceptual reasoning, where AI performance lags significantly, often scoring in the range of the 0.1 to the 10th percentile.

Detailed Results

Working Memory Index (WMI):
- Most AI models showed remarkable abilities in storing, retrieving, and manipulating tokenized information. This capability translated into performance at or above the 99.5th percentile.
- There was a palpable deficit in mathematical reasoning compared to managing numerical data, a discrepancy that was consistent across different models and developers.
Verbal Comprehension Index (VCI):
- Exceptional performance was noted, with many models scoring in the very superior range.
- A notable strength was observed in the retrieval of stored information, akin to "crystallized knowledge" in human cognitive studies.
- Relative weaknesses appeared in linguistic reasoning and understanding, with smaller models generally performing worse.
Perceptual Reasoning Index (PRI):
- Here, the AI models demonstrated significant deficits, indicating difficulty in processing visual stimuli.
- The best-performing model, Claude 3.5 Sonnet, managed a noteworthy improvement over its predecessor yet remained in the borderline performance range for human benchmarks.

Implications and Future Directions

The authors suggest that while generative AI models exhibit human-like proficiency in specific linguistic tasks, their comparative inability in perceptual tasks highlights a substantial area for improvement. This disparity implies that visual processing in AI may require unique architectural adjustments distinct from those managing auditory or linguistic data. The steady progress evidenced in new model iterations, notably in Claude 3.5 Sonnet, suggests a positive trajectory that might continue to bridge these gaps.

Practically, these findings influence numerous applications in fields such as natural language processing, autonomous systems, and human-AI interaction design. Understanding the models’ cognitive strengths and limitations could allow developers to tailor AI applications to exploit areas where they perform best, while concurrently pursuing enhancements in weaker areas.

Conclusion

Overall, the paper provides a comprehensive evaluation of generative AI, uncovering both its robust capabilities and clear limitations. The authors effectively establish a foundation for utilizing human cognitive benchmarks to assess and enhance AI systems. As AI continues to evolve, the insights gained from such analyses will be instrumental in guiding future research and development efforts, steering them towards the attainment of more balanced and versatile cognitive abilities across all domains.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Isaac R. Galatzer-Levy (6 papers)
David Munday (2 papers)
Jed McGiffin (2 papers)
Xin Liu (820 papers)
Danny Karmon (9 papers)
Ilia Labzovsky (4 papers)
Rivka Moroshko (2 papers)
Amir Zait (8 papers)
Daniel McDuff (88 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1844858823351415045

https://twitter.com/GptMaestro/status/1845157955303506107

https://twitter.com/rinatie_ceo/status/1915067263415451856