Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 144 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 210 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics (2404.03301v1)

Published 4 Apr 2024 in cs.CL

Abstract: Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of LLMs such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  2. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416.
  3. Learning Scalar Adjective Intensity from Paraphrases. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1752–1762, Brussels, Belgium. Association for Computational Linguistics.
  4. “Was It Good? It Was Provocative.” Learning the Meaning of Scalar Adjectives. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 167–176, Uppsala, Sweden. Association for Computational Linguistics.
  5. Gerard de Melo and Mohit Bansal. 2013. Good, Great, Excellent: Global Inference of Semantic Intensities. Transactions of the Association for Computational Linguistics, 1:279–290.
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. Time Machine GPT. In NAACL 2024 Findings.
  8. Allyson Ettinger. 2020. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. Transactions of the Association for Computational Linguistics, 8:34–48.
  9. Understanding particularized and generalized conversational implicatures: Is theory-of-mind necessary? Brain and Language, 212:104878.
  10. Aina Garí Soler and Marianna Apidianaki. 2020. BERT Knows Punta Cana is not just beautiful, it’s gorgeous: Ranking Scalar Adjectives with Contextualised Representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7371–7385, Online. Association for Computational Linguistics.
  11. Nicole Gotzner and Diana Mazzarella. 2021. Face Management and Negative Strengthening: The Role of Power Relations, Social Distance, and Gender. Frontiers in psychology, 12:602977.
  12. Scalar Diversity, Negative Strengthening, and Adjectival Semantics. Frontiers in Psychology, 9.
  13. Herbert P Grice. 1975. Logic and Conversation. In Speech acts, pages 41–58. Brill.
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  15. Expectations over Unspoken Alternatives Predict Pragmatic Inferences. Transactions of the Association for Computational Linguistics, 11:885–901.
  16. Predicting scalar diversity with context-driven uncertainty over alternatives. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 68–74, Dublin, Ireland. Association for Computational Linguistics.
  17. Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online. Association for Computational Linguistics.
  18. Hans Kamp. 1975. Two theories about adjectives. In Formal Semantics of Natural Language. Cambridge University Press.
  19. Joo-Kyung Kim and Marie-Catherine de Marneffe. 2013. Deriving adjectival scales from continuous space word representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1625–1630.
  20. Code Simulation Challenges for Large Language Models. arXiv preprint arXiv:2401.09074.
  21. Graph-enhanced Large Language Models in Asynchronous Plan Reasoning. arXiv preprint arXiv:2402.02805.
  22. Syntactic annotations for the Google Books NGram corpus. In Proceedings of the ACL 2012 System Demonstrations, pages 169–174, Jeju Island, Korea. Association for Computational Linguistics.
  23. We’re Afraid Language Models Aren’t Modeling Ambiguity. arXiv preprint arXiv:2304.14399.
  24. Adjective Scale Probe: Can Language Models Encode Formal Semantics Information? Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13282–13290.
  25. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
  26. Isabelle Lorge and Janet Pierrehumbert. 2023. Not Wacky vs. Definitely Wacky: A Study of Scalar Adverbs in Pretrained Language Models. pages 296–316.
  27. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 51–61, Berlin, Germany. Association for Computational Linguistics.
  28. Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment. arXiv preprint arXiv:2402.13956.
  29. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  30. Elizabeth Pankratz and Bob van Tiel. 2021. The role of relevance for scalar diversity: a usage-based approach. Language and Cognition, 13(4):562–594.
  31. Carita Paradis. 1998. Degree Modifiers of Adjectives in Spoken British English.
  32. Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  33. Christopher Potts. 2015. Presupposition and Implicature. The handbook of contemporary semantic theory, pages 168–202.
  34. Language Models Are Unsupervised Multitask Learners. OpenAI blog, 1(8):9.
  35. Eszter Ronai and Ming Xiang. 2022. Three Factors in Explaining Scalar Diversity. In Proceedings of Sinn und Bedeutung, volume 26, pages 716–733.
  36. Eszter Ronai and Ming Xiang. 2023a. Degree estimates as a measure of inference calculation. Proceedings of the Linguistic Society of America, 8:5537.
  37. Eszter Ronai and Ming Xiang. 2023b. Memory Versus Expectation: Processing Relative Clauses in a Flexible Word Order Language. Cognitive Science, 47(1):e13227.
  38. The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs. Advances in Neural Information Processing Systems, 36.
  39. Masked Language Model Scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699–2712, Online. Association for Computational Linguistics.
  40. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3762–3780, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. Harnessing the Linguistic Signal to Predict Scalar Inferences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5387–5403, Online. Association for Computational Linguistics.
  42. Corpus-Based Discovery of Semantic Intensity Scales. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–493.
  43. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  44. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
  45. Scalar Diversity. Journal of Semantics, 33(1):137–175.
  46. Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, Florence, Italy. Association for Computational Linguistics.
  47. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35:24824–24837.
  48. The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10859–10882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  49. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv preprint arXiv:2302.11382.
  50. Bryan Wilkinson and Oates Tim. 2016. A Gold Standard for Scalar Adjectives. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2669–2675, Portorož, Slovenia. European Language Resources Association (ELRA).
  51. A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  52. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  53. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR.
  54. Factual Probing Is [MASK]: Learning vs. Learning to Recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017–5033, Online. Association for Computational Linguistics.
  55. Learning Scalar Adjective Intensity from Paraphrases. Association for Computational Linguistics. PID https://github.com/acocos/scalar-adj/.
  56. Good, Great, Excellent: Global Inference of Semantic Intensities. MIT Press. PID http://demelo.org/gdm/intensity/.
  57. PPDB: The Paraphrase Database. Association for Computational Linguistics. PID http://paraphrase.org/.
  58. BERT Knows Punta Cana is not just beautiful, it’s gorgeous: Ranking Scalar Adjectives with Contextualised Representations. Association for Computational Linguistics. PID https://github.com/ainagari/scalar_adjs/tree/master/data.
  59. Scalar Diversity, Negative Strengthening, and Adjectival Semantics. PID https://github.com/jennhu/expectations-over-alternatives/blob/master/cross-scale/human_data/g18.csv.
  60. Adjectives in WordNet. Oxford University Press, ISLRN 379-473-059-273-1.
  61. The role of relevance for scalar diversity: a usage-based approach. Cambridge University Press. PID https://github.com/jennhu/expectations-over-alternatives/blob/master/cross-scale/human_data/pvt21.csv.
  62. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. Association for Computational Linguistics. PID http://paraphrase.org/.
  63. Three factors in explaining scalar diversity. PID https://github.com/jennhu/expectations-over-alternatives/blob/master/cross-scale/human_data/rx22.csv.
  64. A Gold Standard for Scalar Adjectives. European Language Resources Association (ELRA), ISLRN 691-938-401-573-9.
Citations (1)

Summary

  • The paper demonstrates that LLMs robustly encode scalar adjective lexical semantics using novel probing methods.
  • The paper finds that LLMs struggle with scalar diversity pragmatic reasoning despite strong semantic encoding.
  • The paper highlights that model size and architecture have non-linear effects on performance in both semantic and pragmatic tasks.

Probing LLMs for Understanding of Scalar Adjective Lexical Semantics and Diversity Pragmatics

Introduction to The Study

Exploring the intricacies of scalar adjectives (SAs) and their implications in scalar implicature (SI) presents a nuanced avenue in the evaluation of LLMs' (LLMs) semantic and pragmatic understanding. The paper explores LLMs' grasp of lexical semantics of scalar adjectives and their capability to discern scalar diversity—a phenomenon where some scalar adjectives more likely trigger scalar implicatures than others. By employing an array of LLMs including GPT-4, alongside innovative probing methods, the research unveils insights into the lexical-semantic encoding within these models and their pragmatic reasoning abilities concerning scalar diversity.

Methodology Overview

Probing Lexical Semantics

The paper employs a novel approach to probe LLMs' understanding of SA lexical semantics, focusing on two primary aspects: scale membership and adjective intensity. Utilizing three SA datasets, the researchers assess models across different architectures and sizes, including BERT and RoBERTa families, to illuminate how these factors influence lexical-semantic knowledge. To evaluate scale membership, the paper devises direct and indirect probing methods, leveraging the likelihood scale vectors generated from contextualized word embeddings. Scalar intensity probing, by contrast, investigates models' capacity to recognize intensity variations among SAs located on the same scale, employing direct comparisons and indirect methods involving perplexity measurements of minimal-pair prompts.

Assessing Scalar Diversity Pragmatics

Scalar diversity pragmatics, indicative of an LLM's capability to pragmatically reason about scalar implicature, are gauged through naturalistic probing settings. This segment scrutinizes whether the lexical-semantic comprehension of SAs correlates with the proficiency in executing pragmatic inferences related to scalar diversity. The research introduces an innovative analysis, debiasing models for inherent answer preferences and employing neutral prompts to evaluate scalar diversity reasoning across various LLMs.

Key Findings

  1. Lexical-Semantic Knowledge: Across LLMs, a rich encoding of lexical-semantic information related to SAs was observed. The paper reports nuanced findings regarding scale membership and adjective intensity, with LLMs generally showcasing a profound understanding of these concepts, albeit with varying degrees of accuracy influenced by model architectures and sizes.
  2. Scalar Diversity Reasoning: The pragmatic reasoning about scalar diversity posed a challenge for LLMs. Despite their profound lexical-semantic understanding of SAs, LLMs exhibited limitations in pragmatically reasoning about scalar diversity. Among the tested models, Flan-T5 demonstrated superior performance in scalar diversity tasks, outperforming other models including GPT-4, which showed relatively conservative judgment in implicative reasoning.
  3. Model Size and Architecture: The paper also illuminates the non-linear relationship between a model's size and its performance on lexical-semantic and pragmatic tasks. Larger models did not invariably translate to better performance, with architectural differences and specific training objectives playing significant roles.

Implications and Future Directions

This investigation into LLMs' comprehension of SA lexical semantics and scalar diversity pragmatics unveils critical insights into the semantic and pragmatic dimensions of language understanding by these models. The differential performance across tasks underscores the necessity of nuanced approaches in the development and evaluation of LLMs, especially concerning their pragmatic reasoning capabilities.

The findings prompt further inquiries into the mechanisms LLMs employ to comprehend and generate language, suggesting that enhancing models' pragmatic reasoning abilities may require beyond merely scaling model size. Future research may explore more sophisticated methods and training paradigms to foster a deeper pragmatic understanding in LLMs, potentially bridging the gap between semantic knowledge and pragmatic inference abilities.

The paper's exploration of LLMs through the lens of scalar adjectives and implicature introduces a novel paradigm for evaluating and enhancing LLMs' language understanding capabilities, laying groundwork for future advancements in AI language comprehension and generation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 26 likes.

Upgrade to Pro to view all of the tweets about this paper: