A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? (2308.00109v2)

Published 26 Jul 2023 in cs.CL

Abstract: Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of LLMs have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.

PDF Abstract

An Evaluation of LLMs: Leetspeak as a Modality Test

This paper, authored by Evelina Leivada and collaborators, critically examines the linguistic competence of contemporary LLMs. The premise under scrutiny is whether these models possess human-like capabilities in understanding and processing human language and, subsequently, if they represent credible approximations of human cognitive linguistic processing.

The investigation primarily focuses on assessing the models' ability to comprehend nuanced linguistic disturbances reminiscent of those addressed by human cognitive processing. The authors introduced a novel 'leet task', where alphanumeric substitutions in sentences evaluated the models' capacities to decode and rationalize such transformations accurately. The task is predicated on the hypothesis that successful comprehension and accurate transformation explication require higher-level top-down processing typically absent in LLMs.

Key Findings

Model Performance and Human Comparison: The paper's results are unequivocal in demonstrating that human subjects outperform LLMs in both accuracy and the ability to reason about substitutions in the leet task. Even the best-performing model, ChatGPT-4o, while showing reasonable performance with a low degree of leet transformations, struggles substantially with increased substitutions. This reflects a quantifiable gap between human cognitive processing and LLM outputs.
Mismatch in Accuracy and Reasoning: The research illustrates a discernible discrepancy between LLMs' accuracy in decoding tasks and their ability to reason through their outputs—a trait not paralleled in human performance. This inconsistency is underlined by examining ChatGPT-4o, which occasionally decodes phrases correctly but fails to articulate the reasoning behind its decoding process coherently.
Core Conceptual Understanding: The paper argues that the performance discrepancies arise because LLMs operate over fossilized language data sans real-world grounding—a hypothesis further supported when the models fail to articulate semantic intricacies that underlie linguistic exchanges. As a result, LLM models are shown to lack not only form-meaning mappings but also the ability to contextualize these within pragmatic constructs familiar to human cognition.

Implications and Future Directions

The investigation touches upon the broader implications of LLM functionalities—or limitations—and their presumed cognitive equivalence to human linguistic capacities. The authors suggest that current LLM architectures fall short not primarily due to scalability issues but because they fundamentally diverge from cognitive representations grounded in real-world semantics and interactions. This depicts LLMs as powerful predictive tools rather than explanatory models of human-like linguistic understanding.

A salient implication from this research is the necessity for models to incorporate more intricate and structured cognitive world models. Future directions might explore integrative approaches that synthesize deep learning with symbolic cognition frameworks to better mimic the nuanced structure of human language processing. Another suggestion is refining the interpretative architecture of LLMs with syntactic priors and structured abstractions that transcend mere statistical correlations.

Conclusion

Leivada et al.'s paper serves as a critical reassessment of the current state and theoretical understanding of LLMs. It highlights the fundamental cognitive limitations of LLMs, advocating for advancements that focus on deep semantic learning and cognitive interfacing to achieve human-like linguistic comprehension. The paper reinforces that, while LLMs might excel at formal linguistic tasks, they are yet to achieve the nuanced and grounded understanding emblematic of human language cognition. This paper, thereby, stresses caution against overestimating the AI models' capabilities as fully replicating human linguistic intelligence.