On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior (2006.01912v1)

Published 2 Jun 2020 in cs.CL

Abstract: Human reading behavior is tuned to the statistics of natural language: the time it takes human subjects to read a word can be predicted from estimates of the word's probability in context. However, it remains an open question what computational architecture best characterizes the expectations deployed in real time by humans that determine the behavioral signatures of reading. Here we test over two dozen models, independently manipulating computational architecture and training dataset size, on how well their next-word expectations predict human reading time behavior on naturalistic text corpora. We find that across model architectures and training dataset sizes the relationship between word log-probability and reading time is (near-)linear. We next evaluate how features of these models determine their psychometric predictive power, or ability to predict human reading behavior. In general, the better a model's next-word expectations, the better its psychometric predictive power. However, we find nontrivial differences across model architectures. For any given perplexity, deep Transformer models and n-gram models generally show superior psychometric predictive power over LSTM or structurally supervised neural models, especially for eye movement data. Finally, we compare models' psychometric predictive power to the depth of their syntactic knowledge, as measured by a battery of syntactic generalization tests developed using methods from controlled psycholinguistic experiments. Once perplexity is controlled for, we find no significant relationship between syntactic knowledge and predictive power. These results suggest that different approaches may be required to best model human real-time language comprehension behavior in naturalistic reading versus behavior for controlled linguistic materials designed for targeted probing of syntactic knowledge.

PDF Abstract

An Analytical Overview of Neural LLMs' Predictive Capabilities in Human Comprehension Behavior

The paper "On the Predictive Power of Neural LLMs for Human Real-Time Comprehension Behavior" provides a detailed examination of the relationship between neural LLM (LM) architectures, their training data scale, and their capability to predict human reading behavior. Conducted by a group of researchers from Harvard University and the Massachusetts Institute of Technology, this paper systematically assesses a wide spectrum of LLMs to determine which architectures and training paradigms most closely align with human reading behaviors.

Key Findings

The authors test over two dozen models varying in architecture, such as LSTM-RNNs, Recurrent Neural Network Grammars (RNNGs), Transformers, and $n$ -gram models. They report that, consistent with previous work, the relationship between word log-probability (surprisal) and human reading time is generally linear across model types and training sizes. The models’ predictive behavior was evaluated against human reading times obtained from datasets including the Dundee eye-tracking corpus and self-paced reading data.

Psychometric Predictive Power

A significant portion of the paper focuses on "psychometric predictive power"—the models' ability to predict human reading behavior based on their next-word expectations. The analysis reveals a strong correlation between models' performance, measured by perplexity, and their psychometric predictive capability. However, notable differences emerged across model architectures. Particularly, deep Transformer models and $n$ -gram models exhibited superior predictive power over LSTM-RNNs and RNNGs, especially for eye-tracking data.

Syntactic Knowledge and Predictive Power

Another substantial component of the research was inspecting the linkage between a model's syntactic knowledge and its predictive power. This was evaluated using a series of syntactic tests. Interestingly, the results indicated no significant relationship between syntactic knowledge and a model's ability to accurately predict human reading times, after accounting for the perplexity metric. This suggests that additional linguistic knowledge, potentially modeling aspects other than syntax, might be implicated during real-time comprehension of naturalistic texts.

Implications and Future Directions

The paper elucidates critical insights into the design and deployment of effective LLMs in applications requiring human-comparable understanding. Practically, these findings could inform the development of more sophisticated human-computer interaction systems, adaptive learning environments, and may influence the evaluation metrics used in LLM training.

Theoretically, the dissociation between syntactic capabilities and real-time comprehension prediction opens new venues for research on language processing. Future work could involve refining LMs' architectures to not only improve perplexity but also to capture a broader range of linguistic features pertinent to human comprehension.

In conclusion, while computational models continue to evolve, studies such as this provide valuable benchmarks and insights into their alignment with human cognitive processes. The nuanced understanding of how LMs predict human reading behaviors paves the way for improved models that can handle the complexity and variability inherent in human language use.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ethan Gotlieb Wilcox (9 papers)
Jon Gauthier (11 papers)
Jennifer Hu (22 papers)
Peng Qian (39 papers)
Roger Levy (43 papers)

Citations (143)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos