Papers
Topics
Authors
Recent
Search
2000 character limit reached

Incremental Comprehension of Garden-Path Sentences by Large Language Models: Semantic Interpretation, Syntactic Re-Analysis, and Attention

Published 25 May 2024 in cs.CL | (2405.16042v1)

Abstract: When reading temporarily ambiguous garden-path sentences, misinterpretations sometimes linger past the point of disambiguation. This phenomenon has traditionally been studied in psycholinguistic experiments using online measures such as reading times and offline measures such as comprehension questions. Here, we investigate the processing of garden-path sentences and the fate of lingering misinterpretations using four LLMs: GPT-2, LLaMA-2, Flan-T5, and RoBERTa. The overall goal is to evaluate whether humans and LLMs are aligned in their processing of garden-path sentences and in the lingering misinterpretations past the point of disambiguation, especially when extra-syntactic information (e.g., a comma delimiting a clause boundary) is present to guide processing. We address this goal using 24 garden-path sentences that have optional transitive and reflexive verbs leading to temporary ambiguities. For each sentence, there are a pair of comprehension questions corresponding to the misinterpretation and the correct interpretation. In three experiments, we (1) measure the dynamic semantic interpretations of LLMs using the question-answering task; (2) track whether these models shift their implicit parse tree at the point of disambiguation (or by the end of the sentence); and (3) visualize the model components that attend to disambiguating information when processing the question probes. These experiments show promising alignment between humans and LLMs in the processing of garden-path sentences, especially when extra-syntactic information is available to guide processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. AI@Meta.  (2024). Llama 3 model card.
  2. (2024). Pre-training llms using human-like development data corpus.
  3. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc.
  4. (2001). Thematic roles assigned along the garden path linger. Cognitive psychology, 42(4), 368–407.
  5. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. (2019). What does BERT look at? an analysis of BERT’s attention. arXiv preprint arXiv:1906.04341.
  7. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.
  8. Hale, J.  (2001). A probabilistic Earley parser as a psycholinguistic model. In Second meeting of the north American chapter of the association for computational linguistics.
  9. Ivanova, A. A.  (2023). Running cognitive evaluations on large language models: The do’s and the don’ts.
  10. (2024). Mixtral of experts.
  11. (2022). Garden path traversal in GPT-2. In Proceedings of the fifth blackboxnlp workshop on analyzing and interpreting neural networks for nlp (pp. 305–313). Association for Computational Linguistics. doi: 10.18653/v1/2022.blackboxnlp-1.25
  12. (2024). Mission: Impossible language models. arXiv preprint arXiv:2401.06416.
  13. Koubaa, A.  (2023). GPT-4 vs. GPT-3.5: A concise showdown.
  14. Levy, R.  (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126-1177. doi: https://doi.org/10.1016/j.cognition.2007.05.006
  15. (2023). We’re afraid language models aren’t modeling ambiguity. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 790–807). Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.51
  16. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  17. (1994). The lexical nature of syntactic ambiguity resolution. Psychological review, 101(4), 676.
  18. (2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48), 30046–30054.
  19. (2018). Targeted syntactic evaluation of language models. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1192–1202). Association for Computational Linguistics. doi: 10.18653/v1/D18-1151
  20. OpenAI.  (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  21. (2009). Lingering misinterpretations in garden-path sentences: evidence from a paraphrasing task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(1), 280.
  22. (2019). Language models are unsupervised multitask learners.
  23. (2023). Numeric magnitude comparison effects in large language models. In The 61st annual meeting of the association for computational linguistics.
  24. (2013). Lingering misinterpretations of garden path sentences arise from competing syntactic representations. Journal of Memory and Language, 69(2), 104–120.
  25. (2022). Learning to summarize from human feedback.
  26. (2023). Llama: Open and efficient foundation language models.
  27. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  28. (2024). Evaluating typicality in combined language and vision model concept representations. In Under review.
  29. Vig, J.  (2019). A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714.
  30. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  31. (2023). Using Computational Models to Test Syntactic Learnability. Linguistic Inquiry, 1-44. doi: 10.1162/ling–“˙˝a–“˙˝00491
  32. (2020b). On the predictive power of neural language models for human real-time comprehension behavior. arXiv. doi: 10.48550/arXiv.2006.01912
  33. (2020a). On the predictive power of neural language models for human real-time comprehension behavior. In Proceedings of the 42nd annual meeting of the cognitive science society (p. 1707–1713).
  34. (2019). What syntactic structures block dependencies in rnn language models? arXiv. doi: 10.48550/arXiv.1905.10431
  35. (2021). A targeted assessment of incremental processing in neural language models and humans. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 939–952). Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.76
  36. (2023). A comprehensive capability analysis of GPT-3 and GPT-3.5 series models. arXiv preprint arXiv:2303.10420.
Citations (1)

Summary

  • The paper demonstrates that LLMs initially adopt misinterpretations and later correct their understanding using disambiguating cues.
  • The methodology tracks semantic probabilities and incremental parse tree adjustments across 24 garden-path sentences to mirror human comprehension.
  • The study shows that attention mechanisms, especially in RoBERTa and LLaMA-2, become sensitive to commas, guiding effective syntactic re-analysis.

Incremental Comprehension of Garden-Path Sentences by LLMs: Semantic Interpretation, Syntactic Re-Analysis, and Attention

This paper investigates the processing of garden-path sentences by LLMs to understand their capability in parsing temporarily ambiguous sentences and resolving lingering misinterpretations. The primary aim is to discern whether LLMs align with human processing of garden-path sentences, particularly in leveraging extra-syntactic cues such as commas to guide interpretation.

The authors evaluate four LLMs—GPT-2, LLaMA-2, Flan-T5, and RoBERTa—against three main research questions: (1) Do LLMs initially adopt the misinterpretation and shift to the correct interpretation upon encountering the disambiguating information? (2) Is this shift reflected in the syntactic parse trees constructed by the LLMs? (3) Are the attentional mechanisms of transformer-based LLMs sensitive to disambiguating cues?

Methodology

The study employs 24 garden-path sentences that create temporary ambiguities via optional transitive and reflexive verbs. For each sentence, a pair of comprehension questions corresponds to the misinterpretation and the correct interpretation. The sentences are processed in chunks, and the LLMs' semantic interpretations, parse tree adjustments, and attention mechanisms are analyzed.

Experiments

  1. Dynamic Semantic Interpretations: The semantic probabilities of "yes" and "no" answers to comprehension questions after each chunk are tracked to measure the models' endorsement of misinterpretations and subsequent corrections.
  2. Syntactic Re-Analysis: Parse trees are extracted incrementally to inspect whether LLMs reanalyze the syntactic structure from misinterpretation to the correct parse tree at the point of disambiguation.
  3. Attention Mechanisms: The study examines the attention weights to determine the model components focusing on disambiguating information.

Results

Surprisal Baseline

Surprisal values indicate that all models show modest increases at the point of disambiguation when there is no comma, consistent with temporary ambiguity. GPT-2 and RoBERTa showed increased sensitivity to disambiguating information when a comma was present.

Semantic Interpretations

The models demonstrated varying degrees of alignment with human-like misinterpretations and corrections. Specifically:

  • LLaMA-2 showed a strong initial endorsement of the misinterpretation, which decreased significantly post-disambiguation in the presence of a comma.
  • RoBERTa and GPT-2 initially favor the misinterpretation but show a trend toward the correct interpretation with the extra-syntactic comma.

Parse Tree Adjustments

The parse trees extracted reveal that:

  • RoBERTa-large achieves human-like performance in reanalyzing and correcting parse trees at and post-disambiguation, especially when commas are used.
  • GPT-2 showed moderate success, with increased shifts upon including commas.

Attention Weights

Exploratory analysis of attention weights shows that specific attention heads in LLaMA-2 and RoBERTa-large are sensitive to the point of disambiguation and the correct syntactic reanalysis. Comma presence enhances this sensitivity, demonstrating the model's ability to correct misinterpretations.

Implications and Future Directions

This research contributes to the understanding of LLMs as potential models of human sentence parsing. The experiments highlight that while current LLMs like LLaMA-2 and RoBERTa can align with human-like comprehension patterns, especially with extra-syntactic cues, there are areas for improvement. Future research should expand the dataset of garden-path sentences, include a broader range of LLMs, and investigate emergent abilities in larger models. Additionally, examining a more extensive array of syntactic ambiguities could provide a more comprehensive understanding of these models' cognitive alignment.

In conclusion, the study underscores that LLMs exhibit promising capabilities in processing complex linguistic ambiguities similarly to humans, with certain models demonstrating notable sensitivity to syntactic cues. These results bolster the potential of LLMs not only as engineering feats but also as robust scientific models of human sentence processing.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 17 likes about this paper.