Papers
Topics
Authors
Recent
2000 character limit reached

Lexical Semantic Change Detection

Updated 13 January 2026
  • Lexical Semantic Change Detection is a computational task that identifies and quantifies shifts in word meanings over time using large-scale diachronic corpora.
  • Modern frameworks employ methods like SGNS, CCA, and BOS to align embeddings and measure changes using metrics such as the Jensen–Shannon distance.
  • Evaluations combine manual annotation, multilingual corpora, and precise benchmarks to address challenges like frequency bias and sense granularity.

Lexical Semantic Change Detection (LSC) is the computational task of identifying and quantifying shifts in word meaning over time using large-scale diachronic corpora. LSC supports investigations in linguistics, NLP, lexicography, and the digital humanities. Modern LSC frameworks distill sense change into tractable tasks: binary detection of sense gain/loss and graded quantification of change. State-of-the-art evaluation protocols combine manually annotated gold standards, precise mathematical metrics, and rigorous benchmarking across diverse languages and genres (Schlechtweg et al., 2020). This article synthesizes foundational principles, methodologies, prominent models, evaluation schemes, and current challenges in LSC.

1. Formal Problem Definition and Gold Standard Construction

Unsupervised LSC detection operates on two time-specific corpora, C1C_1 and C2C_2, and a fixed set of target lemmas ww. For each ww, annotation produces sense–frequency distributions D=(D1,,DK)D = (D_1, \dots, D_K) in C1C_1 and E=(E1,,EK)E = (E_1, \dots, E_K) in C2C_2, where each DiD_i (EiE_i) is the count of uses of ww assigned to sense ii (Schlechtweg et al., 2020).

Two primary evaluation targets derive from these distributions:

  • Binary Change Label (B(w){0,1}B(w) \in \{0,1\}): Encodes sense gain/loss. B(w)=1B(w) = 1 iff i\exists i s.t. (DikEin)(D_i \leq k \wedge E_i \geq n) or vice versa, using small thresholds k,nk,n (e.g., k=2k=2, n=5n=5 for 100 uses) to limit annotation noise.
  • Graded Change Score (G(w)[0,1]G(w) \in [0,1]): Degree of shift quantified by Jensen–Shannon distance of normalized distributions P=D/DP = D/\sum D, Q=E/EQ = E/\sum E:

G(w)=JSD(PQ)=12[KL(PM)+KL(QM)]G(w) = \mathrm{JSD}(P \| Q) = \sqrt{\frac{1}{2} [ \mathrm{KL}(P \| M) + \mathrm{KL}(Q \| M) ]}

with M=(P+Q)/2M = (P + Q) / 2.

Gold annotations are constructed by native-speaker relatedness judgments on pairs of usage sentences (scale 1 = unrelated to 4 = identical; total \sim100,000 judgments), followed by correlation clustering on a usage graph to induce sense clusters for each period (Schlechtweg et al., 2020). From the clusterings, DD and EE are computed, allowing B(w)B(w) and G(w)G(w) to be derived for each word.

2. Corpora, Sampling, and Data Stratification

The SemEval-2020 Task 1 established a multilingual standard for LSC evaluation across English, German, Latin, and Swedish (Schlechtweg et al., 2020). Each language is represented by two optimally stratified, lemmatized, POS-tagged, sentence-shuffled subcorpora of comparable size and domain:

Language C₁ Dates C₂ Dates C₁ Tokens C₂ Tokens TTR (C₁/C₂)
English 1810–1860 1960–2010 6.5M 6.7M 13.38/22.38
German 1800–1899 1946–1990 70M 72M 14.25/31.81
Latin 200BC–0AD 0–2000AD 1.7M 9.4M 38.24/26.91
Swedish 1790–1830 1895–1903 71M 110M 47.88/17.27

Target lemmas are balanced for POS and frequency. “Change” candidates are identified via historical dictionaries, with controls matched for part of speech and frequency trajectory. Up to 100 occurrences per word per period are selected for annotation.

3. Computational Approaches and Model Taxonomy

3.1 Type-based Embedding and Alignment Paradigms

Most top-performing systems use static (“type”) embeddings, primarily Skip-Gram Negative Sampling (SGNS), or fastText (Schlechtweg et al., 2020). Separately trained embeddings for C1C_1 and C2C_2 are aligned using linear transformations:

  • Orthogonal Procrustes (OP): Solve W=argminWnBnWAn2W^* = \arg\min_W \sum_n || B_n W - A_n ||^2 for an orthogonal WW^*.
  • Canonical Correlation Analysis (CCA): Find Wx,WyW_x, W_y maximizing correlation of anchor-word vectors across spaces.
  • Thresholding: Binary prediction uses mean cosine distance or quantile-based thresholds.

Other alignment approaches include Word Injection (WI; injecting time-tagged tokens in a joint training), column intersection, and random indexing.

3.2 Contextual and Hybrid Models

Few shared-task systems utilize contextual (“token”) embeddings (ELMo, BERT), typically clustering usage representations and measuring inter-period divergence using Jensen–Shannon distance (Schlechtweg et al., 2020). Hybrid models, ensemble approaches, and graph-based resistance distances appear, but type-based pipelines consistently outperform token models in both binary and graded tasks.

3.3 Bag-of-Substitutes (BOS)

The BOS approach (LSCDiscovery Spanish) employs XLM-R masked language modeling to generate lexical substitute distributions for each usage sentence (Kudisov et al., 2022). By comparing substitute distributions across periods:

  • Change is assessed via Average Pairwise Distance (APD) between BOS vectors.
  • Binary sense gain/loss is predicted using heuristics (AID, min-distance, percentile).
  • Substitutes unique to one period can be directly inspected for sense-level interpretability.

3.4 Evaluation Metrics

  • Binary Classification: Accuracy, Precision, Recall, F₁-score.
  • Ranking: Spearman’s ρ\rho between predicted and gold change scores.

Shared-task bests:

  • UWB (SGNS+CCA): Subtask 1 accuracy 0.687; Subtask 2 ρ\rho=0.481 (Pražák et al., 2020).
  • BOS (Spanish): F₁=0.658 (change), 0.520 (gain), 0.600 (loss); interpretable output (Kudisov et al., 2022).
  • EmbLexChange (pivot-based KL profiles): Binary accuracy up to 77% (Swedish) (Asgari et al., 2020).

4. Findings, Error Analysis, and Methodological Recommendations

4.1 Comparative Model Performance

Type-based embeddings with alignment outperform contextual embeddings on both binary and graded LSC tasks, despite the theoretical inability of type models to directly encode sense distinctions (Schlechtweg et al., 2020). Possible explanations include:

  • Contextual models pre-trained on extraneous data, introducing signal distortion.
  • Sentence shuffling and lemmatization reducing effective context span.
  • Unclear best practices for diachronic modeling with contextual embeddings.

Frequency bias remains pronounced. Predicted change scores correlate with absolute frequency change and are sensitive to minimum frequency, even when target sets are frequency-balanced (Schlechtweg et al., 2020). Polysemy exhibits moderate correlation with gold scores, but model predictions are only weakly affected.

4.2 Explicit Modeling of Sense Inventories

Subtask 1 (sense gain/loss) is challenging for most embedding-based systems. Explicit clustering or definition matching to model sense inventories may yield improvements (Schlechtweg et al., 2020). Direct distribution-distance measures (e.g., JSD on sense distributions or embedding spaces) are more reliable for graded change and ranking.

4.3 Interpretability and Evaluation Protocols

Interpretability remains a challenge. BOS and dependency profile methods enable explicit inspection of sense-level shifts and responsible lexical contexts (Kudisov et al., 2022, Phan-Tat et al., 6 Jan 2026). A recommended future direction is hybrid evaluation controlling for polysemy and frequency bias with transparent gold standards.

5. Challenges, Limitations, and Future Directions

Key obstacles include:

  • Frequency, data sparsity, and polysemy bias in both gold standards and system predictions.
  • The absence of robust evaluation datasets for many languages and genres.
  • Difficulty discriminating subtle sense splits and modeling narrowing/broadening phenomena.
  • Incomplete theoretical understanding of optimal contextual embedding utilization for diachronic tasks.

Emerging lines of inquiry:

  • Frequency-normalization or down-weighting strategies in type embeddings.
  • Improved contextual models designed for time-aware semantics with refined sampling and clustering.
  • Multilingual benchmarking and evaluation platforms extending beyond binary time splits.
  • Integration of external lexical resources and definition-based grounding for sense clusters.
  • Open, interpretable, and theory-aligned methodologies for LSC benchmarking and annotation.

The shared-task paradigm, multilingual gold standards, and comparative frameworks established by SemEval-2020 Task 1 provide the foundation for systematically advancing unsupervised lexical semantic change detection and refining both computational and theoretical approaches (Schlechtweg et al., 2020).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Lexical Semantic Change Detection (LSC).