Constructions are Revealed in Word Distributions

Published 8 Mar 2025 in cs.CL | (2503.06048v1)

Abstract: Construction grammar posits that constructions (form-meaning pairings) are acquired through experience with language (the distributional learning hypothesis). But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what caused a particular word to occur. For that, we need computable models of the distribution over strings -- namely, pretrained LLMs (PLMs). Here we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity. We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) schematic constructions, whose "slots" can be filled by abstract word classes. Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text. Thus, statistical affinity is likely an important, but partial, signal available to learners.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

Analysis of "Constructions are Revealed in Word Distributions"

Pre-trained LLMs (PLMs) are increasingly utilized to investigate and simulate various facets of linguistic theory, particularly within the framework of Construction Grammar (CxG). In the article "Constructions are Revealed in Word Distributions," the authors present an analysis wherein they hypothesize that constructions, as defined by CxG, are encoded within PLMs through statistical affinities observable in word distributions. This paper contributes substantially to the understanding of how different constructions might materialize in PLM outputs, offering insights into the efficacy of distributional models in capturing the nuances of language constructions.

Hypothesis and Methods

The authors posit that constructions, which are form-meaning pairings acquired through linguistic exposure, can be robustly identified and analyzed by observing statistical affinities in word distributions as represented by PLMs. They use RoBERTa, a bidirectional PLM, and develop methods to scrutinize both global and local affinities between words within sample sentences to test their hypothesis.

Two primary methods were employed: global affinity metrics, which measure the model's probability distribution for a word within a full context, and local affinity metrics, which assess pairwise interactions between word positions using Jensen-Shannon divergence. These affordances allow the authors to gather detailed insights into how constructions interact syntactically and contextually.

Key Findings

The paper presents evidence that RoBERTa can distinguish between several construction types that were previously difficult to differentiate. For instance, the model effectively separates Causal Excess Constructions (CEC) from Epistemic and Affective Adjective Phrases (EAP, AAP) based on the global affinity of certain key contextual words (e.g., "so"). These findings challenge earlier perceptions that PLMs might struggle with semantically distinct but superficially similar constructions.

Further analyses extend this approach to other construction types within the Construction Grammar Schematicity corpus (CoGS) and MAGPIE, a corpus of potentially idiomatic expressions. The results suggest that models like RoBERTa can reliably identify both fixed and schematic slots in diverse construction types, thereby capturing key syntactic and semantic properties of constructions.

Critical Evaluation and Implications

While the findings provide strong support for the distributional learning hypothesis and demonstrate that PLMs encode substantial constructional information, the authors note intrinsic limitations to their methods. Affinity scores alone are deemed insufficient to reveal every constructional facet due to the variety of contextual interactions affecting statistical affinity. This insight underscores the complexity of language and indicates that PLMs might encode constructions more as partial signals rather than complete representations.

The implications of this study are twofold. Practically, it suggests that computational models can be valuable tools in linguistic analysis, enabling researchers to unearth nuanced patterns in language data efficiently. Theoretically, it raises intriguing questions about how language learners acquire constructions through exposure to statistical patterns and interactions.

Future Directions

Moving forward, there is ample opportunity for refining these methods to improve sensitivity and specificity in identifying constructions. The paper itself suggests potential pathways, such as using richer semantic tagging and exploring interactions under varied computational frameworks. Moreover, the integration of these methods with other linguistic theories may illuminate further dynamics underlying construction grammar.

In conclusion, by elucidating how constructions manifest in PLMs, "Constructions are Revealed in Word Distributions" opens new avenues for computational linguistic research while reaffirming the critical role of distributional signals in language learning and processing. The robust findings and developed methodologies provide a foundation for subsequent investigations that aim to marry language theory with computational representation more deeply.

Markdown Report Issue