Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Language Models Predict Human Reading Behavior (2104.05433v1)

Published 12 Apr 2021 in cs.CL

Abstract: We analyze if LLMs are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transformer models implicitly encode relative importance in language in a way that is comparable to human processing mechanisms. We find that BERT and XLM models successfully predict a range of eye tracking features. In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing.

This paper "Multilingual LLMs Predict Human Reading Behavior" (Hollenstein et al., 2021 ) investigates the extent to which pretrained transformer LLMs can predict patterns of human reading behavior, as captured by eye-tracking data. The core idea is to determine if these models implicitly encode concepts like "relative importance" in language in a way comparable to human cognitive processing.

The research addresses the practical question of whether state-of-the-art LLMs, specifically BERT and XLM, can be fine-tuned to accurately predict detailed token-level eye-tracking features from naturalistic reading. This is relevant for developers and researchers interested in creating more cognitively plausible AI models, leveraging human behavioral data for NLP tasks, or building applications that adapt to human reading patterns.

Data and Features:

The paper uses eye-tracking data collected from native speakers reading natural texts in four Indo-European languages: English, Dutch, German, and Russian. The datasets include Dundee, GECO, and ZuCo for English, GECO for Dutch, PoTeC for German, and RSC for Russian. For each word (token) in the text, eight specific eye-tracking features are predicted:

  • Word-level characteristics: Number of fixations (NFIX), mean fixation duration (MFD), fixation proportion (FPROP).
  • Early processing: First fixation duration (FFD), first pass duration (FPD).
  • Late processing: Total reading time (TRT), number of re-fixations (NREFIX), re-read proportion (REPROP). These features, measured in milliseconds or as proportions, were standardized to a range of 0-100 for uniform loss calculation during training.

LLMs and Fine-tuning:

The paper evaluates several pretrained transformer models, including monolingual BERT models for Dutch, English, German, and Russian, the multilingual BERT (bert-base-multilingual-cased), and multilingual XLM models (XLM-MLM-EN-2048, XLM-MLM-ENDE-1024, XLM-MLM-17-1280, XLM-MLM-100-1280).

The practical implementation involves fine-tuning these pretrained models for a token regression task. A linear dense layer is added on top of the transformer's output for each token. This layer projects the hidden state representation of the token to an 8-dimensional vector corresponding to the eight eye-tracking features.

The model is trained using the Mean Squared Error (MSE) loss between the predicted 8-dimensional vector and the ground truth eye-tracking feature vector for each token. The AdamW optimizer is used with a linear learning rate decay schedule. Training involves splitting data into 90% training, 5% validation, and 5% test sets, with early stopping based on validation performance.

A key implementation detail highlighted is the handling of subword tokenization. Since transformers like BERT and XLM use subword units (e.g., WordPiece or SentencePiece), a single word might be split into multiple tokens (e.g., "Philammon" -> ['phil', '##am', '##mon']). However, eye-tracking data is typically aggregated at the word level. The authors address this by computing the loss only with respect to the first subtoken of a multi-subword word.

Key Findings and Practical Implications:

  1. High Prediction Accuracy: The fine-tuned models achieve surprisingly high accuracy (reported as 100-MAE) in predicting human eye-tracking features. For English and Dutch datasets, accuracy often exceeds 90%. While lower for German and Russian due to smaller dataset sizes, the performance is still significant. This suggests that transformer models can learn complex patterns of human reading.
  2. Multilingual Models Perform Well: Multilingual models, particularly XLM models, show advantages when fine-tuned on smaller datasets. BERT-MULTI demonstrates strong cross-language generalization capabilities, even across different scripts (Latin vs. Cyrillic), performing better than XLM in direct cross-lingual evaluation. This is valuable for building systems intended to work across multiple languages.
  3. Data Efficiency: XLM models are shown to be more data-efficient during fine-tuning, maintaining stable performance even with smaller percentages of eye-tracking training data compared to BERT models, which show a significant performance drop.
  4. Feature Predictability: Some features are more easily predicted than others. First Pass Duration (FPD) and Number of Re-fixations (NREFIX) are the most accurately predicted. Features based on proportions (FPROP, REPROP) are harder due to higher subject variability but show the largest relative improvement over a mean baseline.
  5. Cognitive Plausibility: The fine-tuned models successfully learn to reflect well-known human reading phenomena, such as the word length effect (longer words receive higher fixation proportion predictions) and sensitivity to text readability (higher accuracy on easier sentences, particularly for pretrained models before fine-tuning). This indicates that the models' learned representations align with human processing strategies.
  6. Generalization: The models demonstrate good generalization across different text domains within the same language and, for multilingual models like BERT-MULTI, across different languages.

Implementation Considerations for Applications:

  • Model Selection: Choose between monolingual and multilingual models based on the target application's language requirements and the availability of eye-tracking data for fine-tuning. Multilingual models (like BERT-MULTI or XLM-100) are suitable for cross-lingual applications or when fine-tuning data is scarce for specific languages. Monolingual models might be slightly better for large datasets in a single language.
  • Data Requirements: Fine-tuning requires access to eye-tracking data paired with text. While large datasets are beneficial, the paper shows that multilingual models can perform well even with limited data per language. The quality and naturalism of the eye-tracking data are crucial.
  • Handling Subwords: Implement logic to map predictions from subword tokens back to word-level predictions, consistent with how the eye-tracking data is represented. The paper's approach of using the first subtoken's prediction is one method, though alternatives (e.g., pooling, averaging) could also be explored.
  • Computational Resources: Fine-tuning large transformer models requires significant computational resources (GPUs). The paper mentions using a single GPU (Titan X with 12GB memory) but notes the need to adjust batch sizes depending on the model size to manage memory constraints (see Appendix C.2 for batch sizes used).
  • Regression Head: The implementation requires adding a simple linear layer with 8 output units on top of the pretrained model's token output. This layer is trained end-to-end with the pretrained transformer weights.
  • Evaluation: Mean Absolute Error (MAE) or 100-MAE (accuracy) is a suitable metric for evaluating prediction performance for continuous eye-tracking features. Analyzing results per feature provides deeper insights.

Potential Applications:

  • Text Readability and Difficulty Assessment: Predict gaze patterns to estimate how difficult a text is likely to be for human readers, going beyond simple formulaic scores like Flesch.
  • Adaptive Reading Interfaces: Develop systems that adjust text presentation, highlighting, or pacing based on predicted reading difficulty or predicted gaze fixations.
  • Synthetic Gaze Data Generation: Generate realistic synthetic eye-tracking data to augment training datasets for other models that might use gaze as input.
  • Improving NLP Models: Use predicted eye-tracking features or train models with eye-tracking data as a weak supervisory signal or inductive bias to potentially improve performance on downstream NLP tasks like reading comprehension or summarization.
  • Cognitive Science Research: Use the models as computational tools to simulate human reading processes and test psycholinguistic hypotheses.

In summary, the paper provides a practical methodology for fine-tuning transformer LLMs to predict human eye-tracking behavior. It demonstrates high accuracy and generalization capabilities, suggesting a shared processing mechanism between these models and human readers regarding language importance. This opens avenues for leveraging human behavioral data in practical AI applications and advancing our understanding of complex LLMs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from transformers import AutoModel, AutoTokenizer
import torch.nn as nn

class GazePredictionModel(nn.Module):
    def __init__(self, model_name="bert-base-multilingual-cased", num_gaze_features=8):
        super(GazePredictionModel, self).__init__()
        self.transformer = AutoModel.from_pretrained(model_name)
        # Add a linear layer on top for regression
        self.regression_head = nn.Linear(self.transformer.config.hidden_size, num_gaze_features)

    def forward(self, input_ids, attention_mask):
        # Get the hidden states from the transformer
        outputs = self.transformer(input_ids, attention_mask=attention_mask)
        # Typically, we'd use the hidden states for each token.
        # The paper uses the representation of the *first subtoken* for each word.
        # This requires careful alignment between word indices and subtoken indices.
        # For simplicity here, we show the per-token output.
        # In a real implementation, you'd map subtoken outputs back to words.
        token_embeddings = outputs.last_hidden_state # Shape: (batch_size, seq_len, hidden_size)

        # Apply the regression head to each token's embedding
        gaze_predictions = self.regression_head(token_embeddings) # Shape: (batch_size, seq_len, num_gaze_features)

        return gaze_predictions

model_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = GazePredictionModel(model_name=model_name)

encoded_inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
input_ids = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']

predicted_gaze = model(input_ids, attention_mask) # Shape (batch_size, seq_len, 8)

This pseudocode outlines the basic structure: load a pretrained model, add a regression layer, and fine-tune on eye-tracking data using MSE loss. The practical challenge lies in handling the subword tokenization to align model predictions with word-level human gaze data. The paper's approach is to predict for all subtokens but calculate the loss only based on the first subtoken of each original word.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nora Hollenstein (21 papers)
  2. Federico Pirovano (3 papers)
  3. Ce Zhang (215 papers)
  4. Lena Jäger (4 papers)
  5. Lisa Beinborn (17 papers)
Citations (42)