SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection (2007.11464v2)

Published 22 Jul 2020 in cs.CL

Abstract: Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.

Citations (229)

View on Semantic Scholar

Summary

The paper introduced a robust evaluation framework and manually annotated multilingual datasets for lexical semantic change detection.
Methodologies encompassed binary classification and ranking tasks across English, German, Latin, and Swedish with broad team participation.
Static-type embeddings outperformed token embeddings, highlighting critical challenges and guiding future research on diachronic language analysis.

Analysis of SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

The paper presented herein details an effort to address a significant challenge in the field of computational lexical semantic change (LSC) detection, specifically the lack of standardized evaluation protocols and data. The researchers organized the first shared task event, SemEval-2020 Task 1, focused on unsupervised LSC detection. The task was designed to provide researchers with a foundation for standardizing evaluations and facilitating comparative analyses between computational models.

Evaluation Framework and Datasets

A major contribution of this paper is the introduction of a robust evaluation framework and the development of manually annotated multilingual datasets encompassing English, German, Latin, and Swedish. This was done by gathering around 100,000 instances of human-annotated data to form a multilingual gold standard. Such an initiative is crucial, as previous research lacked comparability due to differing evaluation methodologies. The availability of these datasets allows for a consistent basis for further LSC studies across various languages, including under-resourced ones.

Subtasks Definition

The paper defined two key subtasks for the LSC detection:

Binary Classification: Distinguishing words that gained or lost meanings over time.
Ranking: Ordering words by the extent of semantic change experienced over time.

The tasks were designed to balance complexity and the practicality of manual annotation, thus enabling wide participation across varied model architectures.

Methodologies and Results

The paper attracted significant attention, resulting in 33 teams submitting 186 systems for evaluation. System performance was assessed using both accuracy for the binary classification task and Spearman's rank-order correlation coefficient for the ranking task. The performance was notably varied across languages, underlining the inherent complexity and differing characteristics of each linguistic dataset.

Notably, models employing static-type embeddings yielded superior results compared to contextual token embeddings—a somewhat unexpected finding, given the latter’s burgeoning reputation for outperforming traditional methods in other NLP tasks. This disparity raises critical observations about the readiness of token embeddings in capturing diachronic changes; these models might not be fully optimized for such specific tasks or corpus preprocessing might not be optimal for token-based models.

The paper also highlights significant challenges in distinguishing high-quality models due to frequency biases and polysemy. Frequency control was integral to this paper, yet some models still displayed a strong bias toward frequency changes rather than semantic changes. Polysemy correlations also remained a distinguishing feature in results, emphasizing the necessity for continued exploration of embeddings' sensitivity to meaning conflation.

Implications and Future Work

This initiative marks a valuable step toward a standardized evaluative approach in lexical semantic change detection. By releasing dataset annotations and structuring a framework for subsequent studies, the research supports the community in transcending previous methodological limitations.

However, the variability of results across languages and the unexpectedly high performance of static-type embeddings underscore the importance of addressing model and task suitability in future endeavors. Further investigation into the adaptation of token embeddings for diachronic LSC detection, the influence of corpus preprocessing, and the development of frequency-independent models will likely be key domains for progression.

In conclusion, SemEval-2020 Task 1 establishes a thorough evaluation paradigm and encourages advancing the scope of LSC detection beyond traditionally major languages. The available framework and datasets will undoubtedly inspire future research, push boundaries in understanding language evolution, and foster improved methodologies for lexical analysis across temporal datasets.

PDF Markdown