Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics (2404.03301v1)

Published 4 Apr 2024 in cs.CL

Abstract: Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of LLMs such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

References (64)

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs robustly encode scalar adjective lexical semantics using novel probing methods.
The paper finds that LLMs struggle with scalar diversity pragmatic reasoning despite strong semantic encoding.
The paper highlights that model size and architecture have non-linear effects on performance in both semantic and pragmatic tasks.

Probing LLMs for Understanding of Scalar Adjective Lexical Semantics and Diversity Pragmatics

Introduction to The Study

Exploring the intricacies of scalar adjectives (SAs) and their implications in scalar implicature (SI) presents a nuanced avenue in the evaluation of LLMs' (LLMs) semantic and pragmatic understanding. The paper explores LLMs' grasp of lexical semantics of scalar adjectives and their capability to discern scalar diversity—a phenomenon where some scalar adjectives more likely trigger scalar implicatures than others. By employing an array of LLMs including GPT-4, alongside innovative probing methods, the research unveils insights into the lexical-semantic encoding within these models and their pragmatic reasoning abilities concerning scalar diversity.

Methodology Overview

Probing Lexical Semantics

The paper employs a novel approach to probe LLMs' understanding of SA lexical semantics, focusing on two primary aspects: scale membership and adjective intensity. Utilizing three SA datasets, the researchers assess models across different architectures and sizes, including BERT and RoBERTa families, to illuminate how these factors influence lexical-semantic knowledge. To evaluate scale membership, the paper devises direct and indirect probing methods, leveraging the likelihood scale vectors generated from contextualized word embeddings. Scalar intensity probing, by contrast, investigates models' capacity to recognize intensity variations among SAs located on the same scale, employing direct comparisons and indirect methods involving perplexity measurements of minimal-pair prompts.

Assessing Scalar Diversity Pragmatics

Scalar diversity pragmatics, indicative of an LLM's capability to pragmatically reason about scalar implicature, are gauged through naturalistic probing settings. This segment scrutinizes whether the lexical-semantic comprehension of SAs correlates with the proficiency in executing pragmatic inferences related to scalar diversity. The research introduces an innovative analysis, debiasing models for inherent answer preferences and employing neutral prompts to evaluate scalar diversity reasoning across various LLMs.

Key Findings

Lexical-Semantic Knowledge: Across LLMs, a rich encoding of lexical-semantic information related to SAs was observed. The paper reports nuanced findings regarding scale membership and adjective intensity, with LLMs generally showcasing a profound understanding of these concepts, albeit with varying degrees of accuracy influenced by model architectures and sizes.
Scalar Diversity Reasoning: The pragmatic reasoning about scalar diversity posed a challenge for LLMs. Despite their profound lexical-semantic understanding of SAs, LLMs exhibited limitations in pragmatically reasoning about scalar diversity. Among the tested models, Flan-T5 demonstrated superior performance in scalar diversity tasks, outperforming other models including GPT-4, which showed relatively conservative judgment in implicative reasoning.
Model Size and Architecture: The paper also illuminates the non-linear relationship between a model's size and its performance on lexical-semantic and pragmatic tasks. Larger models did not invariably translate to better performance, with architectural differences and specific training objectives playing significant roles.

Implications and Future Directions

This investigation into LLMs' comprehension of SA lexical semantics and scalar diversity pragmatics unveils critical insights into the semantic and pragmatic dimensions of language understanding by these models. The differential performance across tasks underscores the necessity of nuanced approaches in the development and evaluation of LLMs, especially concerning their pragmatic reasoning capabilities.

The findings prompt further inquiries into the mechanisms LLMs employ to comprehend and generate language, suggesting that enhancing models' pragmatic reasoning abilities may require beyond merely scaling model size. Future research may explore more sophisticated methods and training paradigms to foster a deeper pragmatic understanding in LLMs, potentially bridging the gap between semantic knowledge and pragmatic inference abilities.

The paper's exploration of LLMs through the lens of scalar adjectives and implicature introduces a novel paradigm for evaluating and enhancing LLMs' language understanding capabilities, laying groundwork for future advancements in AI language comprehension and generation.