Lexical Semantic Change Detection

Updated 13 January 2026

Lexical Semantic Change Detection is a computational task that identifies and quantifies shifts in word meanings over time using large-scale diachronic corpora.
Modern frameworks employ methods like SGNS, CCA, and BOS to align embeddings and measure changes using metrics such as the Jensen–Shannon distance.
Evaluations combine manual annotation, multilingual corpora, and precise benchmarks to address challenges like frequency bias and sense granularity.

Lexical Semantic Change Detection (LSC) is the computational task of identifying and quantifying shifts in word meaning over time using large-scale diachronic corpora. LSC supports investigations in linguistics, NLP, lexicography, and the digital humanities. Modern LSC frameworks distill sense change into tractable tasks: binary detection of sense gain/loss and graded quantification of change. State-of-the-art evaluation protocols combine manually annotated gold standards, precise mathematical metrics, and rigorous benchmarking across diverse languages and genres (Schlechtweg et al., 2020). This article synthesizes foundational principles, methodologies, prominent models, evaluation schemes, and current challenges in LSC.

1. Formal Problem Definition and Gold Standard Construction

Unsupervised LSC detection operates on two time-specific corpora, $C_1$ and $C_2$ , and a fixed set of target lemmas $w$ . For each $w$ , annotation produces sense–frequency distributions $D = (D_1, \dots, D_K)$ in $C_1$ and $E = (E_1, \dots, E_K)$ in $C_2$ , where each $D_i$ ( $E_i$ ) is the count of uses of $w$ assigned to sense $i$ (Schlechtweg et al., 2020).

Two primary evaluation targets derive from these distributions:

Binary Change Label ( $B(w) \in \{0,1\}$ ): Encodes sense gain/loss. $B(w) = 1$ iff $\exists i$ s.t. $(D_i \leq k \wedge E_i \geq n)$ or vice versa, using small thresholds $k,n$ (e.g., $k=2$ , $n=5$ for 100 uses) to limit annotation noise.
Graded Change Score ( $G(w) \in [0,1]$ ): Degree of shift quantified by Jensen–Shannon distance of normalized distributions $P = D/\sum D$ , $Q = E/\sum E$ :

$G(w) = \mathrm{JSD}(P \| Q) = \sqrt{\frac{1}{2} [ \mathrm{KL}(P \| M) + \mathrm{KL}(Q \| M) ]}$

with $M = (P + Q) / 2$ .

Gold annotations are constructed by native-speaker relatedness judgments on pairs of usage sentences (scale 1 = unrelated to 4 = identical; total $\sim$ 100,000 judgments), followed by correlation clustering on a usage graph to induce sense clusters for each period (Schlechtweg et al., 2020). From the clusterings, $D$ and $E$ are computed, allowing $B(w)$ and $G(w)$ to be derived for each word.

2. Corpora, Sampling, and Data Stratification

The SemEval-2020 Task 1 established a multilingual standard for LSC evaluation across English, German, Latin, and Swedish (Schlechtweg et al., 2020). Each language is represented by two optimally stratified, lemmatized, POS-tagged, sentence-shuffled subcorpora of comparable size and domain:

Language	C₁ Dates	C₂ Dates	C₁ Tokens	C₂ Tokens	TTR (C₁/C₂)
English	1810–1860	1960–2010	6.5M	6.7M	13.38/22.38
German	1800–1899	1946–1990	70M	72M	14.25/31.81
Latin	200BC–0AD	0–2000AD	1.7M	9.4M	38.24/26.91
Swedish	1790–1830	1895–1903	71M	110M	47.88/17.27

Target lemmas are balanced for POS and frequency. “Change” candidates are identified via historical dictionaries, with controls matched for part of speech and frequency trajectory. Up to 100 occurrences per word per period are selected for annotation.

3. Computational Approaches and Model Taxonomy

3.1 Type-based Embedding and Alignment Paradigms

Most top-performing systems use static (“type”) embeddings, primarily Skip-Gram Negative Sampling (SGNS), or fastText (Schlechtweg et al., 2020). Separately trained embeddings for $C_1$ and $C_2$ are aligned using linear transformations:

Orthogonal Procrustes (OP): Solve $W^* = \arg\min_W \sum_n || B_n W - A_n ||^2$ for an orthogonal $W^*$ .
Canonical Correlation Analysis (CCA): Find $W_x, W_y$ maximizing correlation of anchor-word vectors across spaces.
Thresholding: Binary prediction uses mean cosine distance or quantile-based thresholds.

Other alignment approaches include Word Injection (WI; injecting time-tagged tokens in a joint training), column intersection, and random indexing.

3.2 Contextual and Hybrid Models

Few shared-task systems utilize contextual (“token”) embeddings (ELMo, BERT), typically clustering usage representations and measuring inter-period divergence using Jensen–Shannon distance (Schlechtweg et al., 2020). Hybrid models, ensemble approaches, and graph-based resistance distances appear, but type-based pipelines consistently outperform token models in both binary and graded tasks.

3.3 Bag-of-Substitutes (BOS)

The BOS approach (LSCDiscovery Spanish) employs XLM-R masked language modeling to generate lexical substitute distributions for each usage sentence (Kudisov et al., 2022). By comparing substitute distributions across periods:

Change is assessed via Average Pairwise Distance (APD) between BOS vectors.
Binary sense gain/loss is predicted using heuristics (AID, min-distance, percentile).
Substitutes unique to one period can be directly inspected for sense-level interpretability.

3.4 Evaluation Metrics

Binary Classification: Accuracy, Precision, Recall, F₁-score.
Ranking: Spearman’s $\rho$ between predicted and gold change scores.

Shared-task bests:

UWB (SGNS+CCA): Subtask 1 accuracy 0.687; Subtask 2 $\rho$ =0.481 (Pražák et al., 2020).
BOS (Spanish): F₁=0.658 (change), 0.520 (gain), 0.600 (loss); interpretable output (Kudisov et al., 2022).
EmbLexChange (pivot-based KL profiles): Binary accuracy up to 77% (Swedish) (Asgari et al., 2020).

4. Findings, Error Analysis, and Methodological Recommendations

4.1 Comparative Model Performance

Type-based embeddings with alignment outperform contextual embeddings on both binary and graded LSC tasks, despite the theoretical inability of type models to directly encode sense distinctions (Schlechtweg et al., 2020). Possible explanations include:

Contextual models pre-trained on extraneous data, introducing signal distortion.
Sentence shuffling and lemmatization reducing effective context span.
Unclear best practices for diachronic modeling with contextual embeddings.

Frequency bias remains pronounced. Predicted change scores correlate with absolute frequency change and are sensitive to minimum frequency, even when target sets are frequency-balanced (Schlechtweg et al., 2020). Polysemy exhibits moderate correlation with gold scores, but model predictions are only weakly affected.

4.2 Explicit Modeling of Sense Inventories

Subtask 1 (sense gain/loss) is challenging for most embedding-based systems. Explicit clustering or definition matching to model sense inventories may yield improvements (Schlechtweg et al., 2020). Direct distribution-distance measures (e.g., JSD on sense distributions or embedding spaces) are more reliable for graded change and ranking.

4.3 Interpretability and Evaluation Protocols

Interpretability remains a challenge. BOS and dependency profile methods enable explicit inspection of sense-level shifts and responsible lexical contexts (Kudisov et al., 2022, Phan-Tat et al., 6 Jan 2026). A recommended future direction is hybrid evaluation controlling for polysemy and frequency bias with transparent gold standards.

5. Challenges, Limitations, and Future Directions

Key obstacles include:

Frequency, data sparsity, and polysemy bias in both gold standards and system predictions.
The absence of robust evaluation datasets for many languages and genres.
Difficulty discriminating subtle sense splits and modeling narrowing/broadening phenomena.
Incomplete theoretical understanding of optimal contextual embedding utilization for diachronic tasks.

Emerging lines of inquiry:

Frequency-normalization or down-weighting strategies in type embeddings.
Improved contextual models designed for time-aware semantics with refined sampling and clustering.
Multilingual benchmarking and evaluation platforms extending beyond binary time splits.
Integration of external lexical resources and definition-based grounding for sense clusters.
Open, interpretable, and theory-aligned methodologies for LSC benchmarking and annotation.

The shared-task paradigm, multilingual gold standards, and comparative frameworks established by SemEval-2020 Task 1 provide the foundation for systematically advancing unsupervised lexical semantic change detection and refining both computational and theoretical approaches (Schlechtweg et al., 2020).