Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 88 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test (2505.22112v1)

Published 28 May 2025 in cs.AI and q-bio.NC

Abstract: Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual LLMs (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.

Summary

Cognitive Flexibility of Visual LLMs Assessed via the Wisconsin Card Sorting Test

The paper "Visual LLMs Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test" presents an exploration into the cognitive abilities of Visual LLMs (VLLMs) using the classic Wisconsin Card Sorting Test (WCST) as a benchmark. The authors aim to assess the extent of cognitive flexibility inherent in VLLMs, specifically focusing on set-shifting capabilities, a fundamental aspect of human cognition related to adaptability and problem-solving.

Methodology and Experimental Design

The research rigorously tests the cognitive flexibility of three cutting-edge VLLMs: GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet. These models are evaluated under various conditions by manipulating input modalities (Visual Input vs. Textual Input) and prompting strategies (Straight-to-Answer vs. Chain-of-Thought reasoning). Using a randomized design, the models undergo structured testing with 64 trials, following the WCST protocol by asking them to sort cards according to undisclosed rules that change intermittently. Additionally, the paper includes a human baseline for performance comparison and introduces a novel ALIEN Task variant to rule out memorization effects.

Key Findings

Performance Variability: The paper uncovers significant variability in set-shifting performance across different modalities and prompting strategies. Models exhibit superior performance in the Chain-of-Thought reasoning condition with textual inputs, surpassing human baseline levels. Claude-3.5 Sonnet achieves perfect performance under these conditions, and all models fail to consistently apply the sorting rules in visual input conditions without explicit reasoning prompts.
Input Modalities: The paper details that textual inputs yield better cognitive flexibility than visual inputs across all models. While visual processing accuracy is largely satisfactory, occasional perceptual errors lead to increased performance variability in visual conditions, highlighting potential areas for improvement in current VLLM architectures.
Rule Exclusivity and Real-World Implications: Without explicit constraints on rule exclusivity, performance drops significantly. This indicates that VLLMs depend heavily on precise task instructions, which could influence their application in ambiguous or complex real-world scenarios where input clarity is not guaranteed.
Simulating Cognitive Impairment: The use of role-playing prompts reveals that VLLMs can mimic patterns of cognitive dysfunction similar to those observed in neuropsychological studies of human patients with prefrontal cortex impairments. Claude-3.5 Sonnet, which has excellent baseline flexibility, shows substantial vulnerability under simulated impairment conditions.

Implications and Future Directions

The paper posits that VLLMs are closing the gap toward human-level cognitive flexibility, particularly when tasks are mediated by deliberate reasoning prompts. The ability to model and simulate cognitive impairments expands these models' application potential in clinical research and enhancing AI safety protocols. Nonetheless, the reliance on explicit instructions points to the necessity for further refinement, particularly in terms of adaptability to less structured environments.

Future research should explore the internal mechanisms that enable such sophisticated cognitive modeling, broaden the application range of these findings, and explore the integration of multimodal information processing. Enhancing visual processing capabilities and systematically addressing ambiguities in real-world scenarios are key areas for advancement. Moreover, investigating the cognitive architecture underlying VLLM performance will be essential for understanding their limitations and potential as mirrors of human cognition.

In conclusion, the paper provides a thorough analysis of VLLMs' cognitive flexibility, demonstrating remarkable potential and important limitations. It outlines a trajectory for future developments in AI, suggesting an increasingly nuanced approach to understanding and replicating human cognitive processes.