Lexical Semantic Change Detection

Updated 21 November 2025

Lexical Semantic Change Detection is the computational identification of evolving word meanings by comparing temporally distinct corpora.
It employs diverse methods, including type-based embeddings, contextualized models, clustering, and statistical metrics to measure semantic shifts.
Empirical benchmarks and hybrid approaches enhance its interpretability and accuracy, benefiting historical linguistics, lexicography, and NLP.

Lexical Semantic Change Detection (LSCD) is the computational identification and quantification of how the meanings of words evolve across time or between domains. LSCD has direct implications in historical linguistics, lexicography, digital humanities, NLP, and language documentation. The field is methodologically diverse, comprising a wide range of distributional, contextualized, and interpretable approaches. This survey synthesizes rigorous computational models, benchmark evaluations, and the theoretical underpinnings of LSCD, focusing on recent advances, empirical results, and open challenges.

1. Theoretical Foundations and Task Formulation

LSCD is defined as the data-driven detection of words whose meanings or senses have shifted, emerged, or disappeared over time by comparing their distributions in two temporally distinct corpora. Formally, given corpora $C_1$ (early) and $C_2$ (late) and a lexicon $W$ , the LSCD task requires models to produce, for each target $w \in W$ , either a binary change score $B(w) \in \{0,1\}$ or a graded change measure $G(w) \in [0,1]$ (Schlechtweg et al., 2020, Schlechtweg et al., 2019).

LSCD research is anchored by several foundational principles:

The Law of Differentiation (LD): Synonymous words tend to diverge in meaning over time or one disappears.
The Law of Parallel Change (LPC): Synonyms tend to undergo the same shifts and remain synonymous through time (Liétard et al., 2023).

Two core subproblems recur:

Unsupervised Change Detection: No time-aligned supervision is available. Models infer change directly from raw corpora.
Instance-Level vs. Corpus-Level: Detection can be aimed at aggregated word-level drift or fine-grained instance/sense-level changes (Wang et al., 2023, Kishino et al., 17 Dec 2024).

Benchmarking is underpinned by large-scale, gold-annotated datasets (e.g., SemEval-2020 Task 1: English, German, Latin, Swedish) with manual sense-clustering, relatedness rating, and change annotation (Schlechtweg et al., 2020).

2. Modeling Methodologies

2.1 Type-Based Embedding Approaches

Predictive Embeddings (SGNS, fastText): The canonical approach involves training separate Skip-Gram Negative Sampling (SGNS) or subword embeddings for $C_1$ and $C_2$ . Spaces are aligned via orthogonal Procrustes or Canonical Correlation Analysis to correct coordinate-system drift (Pražák et al., 2020, Pražák et al., 2020, Schlechtweg et al., 2020).

Change Metrics:

Cosine Distance (CD): $1 - \cos(\mathbf{v}_{w,1}, \mathbf{v}_{w,2})$ .
Mapped Neighborhood Distance (MAP): Second-order distance based on shifts in nearest neighbors (Gruppi et al., 2020).
Frequency Differential: $|f_1(w)-f_2(w)|/(f_1(w)+f_2(w))$ .

Profile Divergence (EmbLexChange): KL divergence between softmax-normalized similarity profiles to stable pivot words, resampled for robustness and normalized via empirical bounds (Asgari et al., 2020).

2.2 Contextualized Embedding and Clustering

Token-Based Models: BERT, XLM-R, and similar models provide per-token contextualized embeddings, enabling sense-differentiation. Typical approaches involve:

Clustering usage embeddings to infer sense distributions (K-means, HDBSCAN), then comparing sense-frequency distributions across periods via Jensen–Shannon divergence or entropy difference (Giulianelli et al., 2020).
Calculating average pairwise or prototype distances between usage vectors from each time period (Giulianelli et al., 2022).

Lexical Substitution: Masked LLMs (e.g., XLM-R large) generate "bags-of-substitutes" per occurrence, constructing interpretable sense vectors directly from model predictions; gain/loss of senses is detected by inspecting period-specific substitute distributions (Kudisov et al., 2022).

2.3 Metric Learning and Sense-Aware Methods

Sense-Aware Metric Learning (SDML): Supervised fine-tuning on Word-in-Context (WiC) data yields sense-sensitive encoders. Afterwards, Mahalanobis metrics are learned to maximize separation between representations of different senses; change is quantified by measuring mean pairwise metric distance across all period-specific occurrences (Aida et al., 1 Mar 2024).

2.4 Bayesian and Topic Models

Dynamic Bayesian Mixture Models (GASC): Assign latent senses with genre-aware temporal smoothing. Infers genre-conditioned sense trajectories, enabling detection and interpretation of sense appearance/disappearance (Perrone et al., 2021).

2.5 Optimal Transport & Instance-Level Alignment

Unbalanced Optimal Transport (UOT): Models instance-level semantic change by aligning sets of contextualized embeddings; tracks creation or disappearance of usage clusters and computes Sense Usage Shift (SUS) per instance (Kishino et al., 17 Dec 2024).

3. Benchmarking, Evaluation, and Empirical Results

Shared Tasks: The field is shaped by shared evaluations with robust manual gold standards. SemEval-2020 Task 1 is the principal benchmark, offering:

Subtask 1: Binary change detection (accuracy, F1).
Subtask 2: Graded ranking (Spearman’s ρ to human gold) (Schlechtweg et al., 2020).

Performance Milestones:

Type-Based Models (SGNS+Procrustes): Achieve consistently high accuracy in binary change detection (max accuracy ~0.75–0.81 on German, ~0.62 English; Spearman’s ρ ~0.85 on German) (Ahmad et al., 2020, Pražák et al., 2020, Schlechtweg et al., 2019).
Clustering/Contextualized Models: Provide additional interpretability and enable instance-level change analysis but occasionally underperform static approaches on small datasets (Giulianelli et al., 2020, Giulianelli et al., 2022).
Ensembles (SChME): Combining multiple signals (CD, MAP, FREQ) improves robustness; alignment quality (landmark selection) is critical and language-dependent (Gruppi et al., 2020).
Sense-Aware Metric Learning (SDML): Sets new state-of-the-art on several SCD splits (ρ up to 0.90 on German; 0.77 English) (Aida et al., 1 Mar 2024).

Statistical Significance: Permutation-based hypothesis testing combined with false discovery rate correction provides rigorous control over false positives, especially for low-frequency and small data (Liu et al., 2021).

4. Interpretability, Sense Resolution, and Explainability

Interpretability remains a central concern; advances include:

Definition Generation: Fine-tuned LLMs generate definitions per usage, clustered via string-edit distance, with sense-frequency distributions compared by JSD. This achieves competitive or superior ranking correlations versus embedding-only and sense-embedding baselines, and enables inspection of changes in explicit human-readable definitions (2406.14167).
Bag-of-Substitutes (BOS): Yields interpretable evidence for sense gain/loss by inspecting emergent substitute patterns. Effective at sense gain/loss detection in Spanish LSCD (GAIN, LOSS F1 up to 0.60) (Kudisov et al., 2022).

5. Practical Challenges and Error Analysis

5.1 Known Issues

Alignment Sensitivity: Orthogonal Procrustes and CCA are state-of-the-art for cross-space alignment; alignment quality crucially depends on landmark/vocabulary selection, with frequency and language-dependent tradeoffs (Gruppi et al., 2020, Pražák et al., 2020).
Polysemy and Frequency Bias: Models misclassify highly polysemous or low-frequency words; frequency trajectories impact scores, complicating sense vs. usage-frequency shifts (Liétard et al., 2023).
Disambiguation Limitations: Static embeddings fail to separate senses, leading to sense-conflation errors (Giulianelli et al., 2020, Giulianelli et al., 2022).
Hypernym Confusion: Up to 30% of differentiated synonym pairs shift into direct hypernymy, causing misclassification (Liétard et al., 2023).

5.2 Model Selection and Optimization

Pre/Post-Processing: Mean-centering and principal component removal mitigate frequency artifacts. Pre-training on merged diachronic corpora alleviates small-data issues; length-normalization avoids frequency bias escalations from large pre-trained models (Kaiser et al., 2021).
Contextual vs. Static Modeling: Type embeddings (SGNS) are robust for unsupervised change detection; contextualized embeddings are essential for instance-level or sense-aware interpretations (Giulianelli et al., 2020).

6. Advances in LLMs and Hybrid Systems

Zero-Shot Prompting (LLMs): GPT-4, via carefully designed zero-shot prompts, outperforms both traditional embedding models and BERT on instance-level and corpus-level LSCD tasks in low-resource, noisy settings. Prompt design (instructional phrasing, meta-data) significantly impacts performance (accuracy up to 0.72, F1=0.65 on TempoWiC) (Wang et al., 2023).

Hybrid Systems: Combining grammatical profiles with multilingual Transformers (XLM-R) in geometric mean ensembles consistently outperforms individual models, especially for typologically diverse or long time-depth datasets (state-of-the-art on Latin and Norwegian) (Giulianelli et al., 2022).

7. Open Challenges and Future Directions

Sense-Specific Drift: Improved clustering, sense induction, and joint modeling of explicit sense trajectories are areas of active research.
Controlled Benchmarks: Need for larger, cross-lingual gold standards, polysemy-controlled and domain-variant evaluation sets (Schlechtweg et al., 2020).
Temporal Granularity: Methods for modeling gradual versus abrupt shifts across multiple time slices (beyond two periods) are in nascent stages (Kurtyigit et al., 2021).
Metadata and Sociolinguistics: Integration of genre, author, and domain meta-data increases explanatory power and mitigates genre imbalance (Perrone et al., 2021).
Explainable AI: Definition-based, substitute-based, and instance-level OT approaches pave the way for transparent, interpretable LSCD (2406.14167, Kishino et al., 17 Dec 2024).
Statistical Robustness: Incorporation of permutation/FDR, dependency-aware multiple-testing, and uncertainty quantification remains critical for rigorous application in digital humanities and historical linguistics (Liu et al., 2021).

Recent advances in LSCD comprise a mature suite of unsupervised and supervised pipelines, with robust benchmark evaluations, interpretable outputs, and growing integration of sense encoding, statistical significance estimation, and hybrid modeling paradigms. The field is poised for further innovation at the intersection of contextualized modeling, interpretability, and cross-domain evaluation.