Large Language Models Are Human-Like Internally (2502.01615v1)

Published 3 Feb 2025 in cs.CL

Abstract: Recent cognitive modeling studies have reported that larger LLMs (LMs) exhibit a poorer fit to human reading behavior, leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.

Authors (5)

Tatsuki Kuribayashi (31 papers)
Yohei Oseki (22 papers)
Souhaib Ben Taieb (18 papers)
Kentaro Inui (119 papers)
Timothy Baldwin (125 papers)

Summary

LLMs Are Human-Like Internally

This paper explores the cognitive plausibility of LLMs with respect to human reading behavior, challenging previous notions that larger models are inherently less human-like in their cognitive representations. The paper investigates how probabilities derived from various internal layers of these models correlate with human sentence processing metrics. This examination brings to light the importance of mechanistic interpretability in understanding such correlations.

Researchers have traditionally evaluated LLMs based on the cognitive plausibility of their final-layer outputs when mapping those to human reading time data. However, the present paper uncovers that surprisal values, or the predictability of a word given its context, show stronger alignment with human reading metrics when derived from internal layers, rather than final layers. This finding adjusts the perspective on past studies, which focused solely on the final layers and consequently underestimated the cognitive alignment of LLMs.

Several key observations were made:

Layer-Specific Alignment with Cognitive Tasks: The paper revealed that different internal layers of LLMs are associated with different facets of human sentence processing. Surprisal derived from early layers aligns more closely with fast responses, such as first-pass gaze durations, implicating early-stage lexical processing. In contrast, later layers align better with slower, more integrative processes indexed by N400 event-related potentials and MAZE task processing times.
Implications for Model Scaling: When considering internal layers, larger LLMs surpass smaller models in providing cognitive plausibility. This suggests that when all model layers are taken into account, larger models possess a nuanced representation aligned with varying stages of human cognition. Thus, a layered analysis counteracts the "bigger is not always better" phenomenon and advocates for scaling LLMs in cognitive applications.
Mechanistic Interpretability as a Tool: The use of tools such as the "logit lens" and "tuned lens" proved beneficial in mapping internal representations directly to outputs. By doing so, these tools enable exploration beyond final-layer operations, thereby enhancing interpretability and revealing layer-specific human-like patterns in language processing tasks.

These findings propose a shift in LLM cognitive modeling, encouraging the integration of internal representational insights over a sole reliance on final-layer outputs. This nuanced approach opens avenues for improved cognitive modeling of human language comprehension and suggests that larger models, through detailed layer examination, may offer insights into intermediate cognitive processing akin to humans.

In terms of practical and theoretical implications, the paper posits that insights gained from internal model layers can be critical for advancing adaptive and efficient human-like AI systems. These insights not only enhance our understanding of how LLMs process language but also pave the way for future research focusing on layer-wise alignment with human cognitive processes. This layered analysis supports a comprehensive exploration, enhancing the potential for LLMs in simultaneous linguistic and cognitive development. Overall, the paper bridges a critical gap between neural network architectured and neuroscientific evidence of human cognition, informing the design of explanatory and human-compatible AI systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ttk_kuribayashi/status/1887050083402650088