This paper (Zhang et al., 28 Jan 2024 ) introduces a novel approach to understand human reading by integrating word embeddings from LLMs like BERT with neurophysiological data from Electroencephalography (EEG) and behavioral data from eye-tracking. The goal is to create a "reading embedding" that captures not just the semantic meaning of words but also the cognitive processes involved when a person reads them. This work aims to serve as a foundation for developing AI-assisted tools to improve reading comprehension.
The paper utilizes the Zurich Cognitive Language Processing Corpus (ZuCo) 1.0 dataset, specifically focusing on Task-Specific Reading (TSR) data where subjects answer questions related to the text. The core idea is to train a model to predict whether a word is highly relevant (HRW) or lowly relevant (LRW) to an inference task, using labels generated by powerful LLMs (GPT-3.5 Turbo and GPT-4) as a form of "fuzzy ground truth". This LLM-guided labeling process is validated by demonstrating that BERT word embeddings alone can predict these labels with high accuracy (92.7%).
For practical implementation, the paper details the processing pipeline:
- Word Embedding: Each word/token in the sentence is processed using a pre-trained BERT model. The hidden state from the second-to-last layer (dimension 768) is extracted and L2 normalized to represent the word's semantic meaning within its context. Padding is applied for sentences shorter than the maximum length.
- Biomarker Feature Extraction:
- Eye-gaze: 12 distinct features are extracted per word, including various fixation durations, total reading time, gaze duration, and pupil size metrics. These features are L1 normalized within each sentence.
- EEG: Features are extracted using the conditional entropy method, resulting in a 5460-dimensional vector per word.
- Handling Multiple/Zero Fixations: For words with multiple fixations, the corresponding eye-gaze and EEG vectors are processed by taking their L2 norm and then performing element-wise addition across all fixations for that word. If a word has no fixation data, zero vectors are assigned.
- Reading-Embedding Model Architecture:
- Both the processed eye-gaze and EEG features are linearly projected into a common lower-dimensional space (128 dimensions).
- These projected features are combined using element-wise addition.
- Sinusoidal positional encoding is applied to the combined features.
- The combined features are then fed into a single attention-based transformer encoder block.
- An MLP layer follows the transformer to output a probability for binary classification (HRW/LRW).
- Training: The model is trained using a combined loss function consisting of Masked Binary Cross Entropy, Masked Mean Squared Error, and Masked Soft F1 Loss. The losses are weighted equally (). Stochastic Gradient Descent (SGD) with a learning rate of 0.05 is used for optimization. To handle class imbalance, LRW samples are downsampled during training and testing to match the number of HRW samples. Evaluation is performed using 5-fold cross-validation applied individually to the data of each subject.
The results demonstrate that integrating EEG and eye-gaze data and processing them through the transformer model achieves better accuracy (average 68.7%, max 71.2%) in classifying HRW vs LRW words compared to using either modality alone or using simpler linear classifiers like SVM. This indicates that the multi-modal approach combined with the transformer's ability to capture relationships within the sequence enhances the prediction of relevance based on human reading patterns.
The practical implications of this research lie in its potential to power future brain-computer interface (BCI) applications for reading assistance. By identifying words that correlate with signs of cognitive difficulty or lack of attention (potentially when human reading patterns diverge from the LLM-predicted relevance), an assistive tool could provide real-time feedback or interventions. The use of readily available datasets like ZuCo and the implementation details provided (including code availability) make this a practical step towards developing such tools. Future work is suggested to apply this method to reading tasks where subjects exhibit lower comprehension, as these scenarios are where assistance would be most valuable.