Evaluating Cognitive Maps and Planning in Large Language Models with CogEval (2309.15129v1)

Published 25 Sep 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Recently an influx of studies claim emergent cognitive abilities in LLMs. Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

PDF HTML Abstract

Evaluating Cognitive Maps and Planning in LLMs with CogEval

The paper "Evaluating Cognitive Maps and Planning in LLMs with CogEval" challenges prevailing narratives around the emergent cognitive abilities of LLMs by investigating their capacity for cognitive mapping and planning. The authors introduce a comprehensive evaluation framework, CogEval, which is methodically inspired by cognitive science, for assessing the cognitive capabilities exhibited by LLMs.

Core Contributions

CogEval Protocol: The paper's foremost contribution is the introduction of CogEval, a protocol designed to systematically evaluate cognitive abilities like theory of mind, causal reasoning, and planning in LLMs. CogEval emphasizes the importance of avoiding dataset contamination, employing multiple tasks and conditions, conducting numerous response generations, and ensuring robust statistical analysis. This methodological framework can be applied broadly across various cognitive constructs in LLMs beyond just cognitive maps and planning.
Analysis Across Models: The authors employ CogEval to assess cognitive maps and planning abilities in eight LLMs, including OpenAI GPT-4 and GPT-3 variants, Google Bard, LLaMA, and others. The evaluation is conducted using task prompts derived from established human cognitive science experiments adapted into novel linguistic formats, ensuring minimal overlap with training datasets.

Experimental Findings

The systematic examination of LLMs using CogEval reveals that although some LLMs display competence in simpler planning tasks, they generally falter when faced with complex tasks requiring genuine cognitive mapping and planning. Notable failure modes include hallucinating invalid state transitions and being caught in repetitive loops, indicating a lack of emergent latent relational understanding consistent with true cognitive mapping.

Statistical Insights

Statistical analysis within this paper underscores that temperature settings in LLMs, as well as the complexity of task structures and domains, significantly impact performance. However, no substantial evidence was found to support the presence of inherent planning capabilities across any of the evaluated LLMs, even in highly sophisticated models like GPT-4.

Implications and Future Directions

This research holds significant implications for the future development and application of AI technologies. The findings caution against the presumption of innate high-level cognitive abilities in current-generation LLMs, explicitly in domains requiring planning and navigation of complex data structures. Consequently, there is a growing imperative to augment LLMs with mechanisms analogous to human executive functioning for memory and planning, such as augmented executive control systems or specialized architectural advancements.

Speculatively, future advancements may necessitate a shift towards models exhibiting more nuanced, energy-efficient architectures that mirror the specialized faculties of biological brains. Such models could potentially achieve cognitive tasks with smaller scale and resource footprints, emphasizing effective task and domain specialization.

The paper contributes to advancing the discourse on the relationship between model size and cognitive functionality, steering the focus towards methodological rigor and architectural efficacy in enhancing AI capabilities. The findings underscore the necessity for continuous research into affixed cognitive mechanisms that underpin planning and reasoning within artificial neural networks.