Word Similarity Matching (WSM)

Updated 18 December 2025

Word Similarity Matching is a suite of computational methods that quantifies semantic similarity between words, senses, or contexts, with broad applications in NLP.
Modern approaches utilize distributional semantics and contextual embeddings like BERT and ELMo, achieving improved performance in synonym detection and word sense disambiguation.
Techniques such as optimal transport (Word Mover’s Distance) and classifier confusion models enhance alignment with human judgments and capture nuanced semantic differences.

Word Similarity Matching (WSM) refers to the suite of computational methods, theoretical models, and evaluation strategies developed to quantify the semantic similarity between words, word senses, or contextualized word usages. WSM plays a crucial role in NLP tasks such as information retrieval, word sense disambiguation (WSD), semantic search, machine translation, and the evaluation of distributed representations. Modern approaches extend far beyond simple overlap or semantic networks, encompassing distributional semantics, optimal transport, contextual embeddings, and even feature-based classification confusion.

1. Foundations and Taxonomies of Word Similarity

Early WSM methods derived similarity as a property of word pairs within structured resources such as WordNet, employing taxonomy-based metrics. Classical measures include:

Path-based similarity: Based on shortest edge distance in the ontology; e.g., $sim_{path}(s_1, s_2) = 1/(d(s_1, s_2) + 1)$ .
Leacock–Chodorow (lch): $sim_{lch} = -\log\left((d(s_1,s_2)+1)/2D_{max}\right)$ .
Wu–Palmer (wup) and Information Content (IC)–based metrics (Resnik, Lin, Jiang–Conrath): These quantify similarity using the properties of the least common subsumer and empirical sense frequencies, e.g., $sim_{lin} = 2 \text{IC}(L(s_1,s_2)) / (\text{IC}(s_1)+\text{IC}(s_2))$ (Jacobs et al., 2018).

Distributional approaches, such as LSA and word2vec-derived embeddings, encode words in dense vector spaces and operationalize similarity primarily as vector cosine. Hybrid models incorporate both surface-level and sublexical features—morphology, frequency, valence, imageability—alongside semantic similarity, expanding the explanatory power of WSM for predicting human rating data (Jacobs et al., 2018).

Process-compositional and hybrid regression models (e.g., Extra-Trees, neural nets) combining taxonomic, distributional, and quantitative narrative analysis (QNA) features explain up to ~50% of the variance in benchmark ratings, with Lin and lch metrics, skip-gram embeddings, and surface/affective features all contributing (Jacobs et al., 2018).

2. Distributional and Embedding-Based Methods

Modern WSM predominantly leverages distributional semantics. Given word embeddings $u, v \in \mathbb{R}^d$ , similarity is computed via:

$sim_{cos}(u, v) = \frac{u \cdot v}{\|u\|\|v\|}$

Recent unsupervised measures such as APSyn focus on qualitative overlap in top-k mutually dependent contexts, weighting the intersection inversely by average rank to outperform cosine similarity in tasks emphasizing true synonymy. Empirically, APSyn $_{1000}$ achieved 58.33% accuracy vs. 49.44% for cosine similarity on ESL synonym selection (Santus et al., 2016). APSyn's O( $k$ ) query time and focus on the most salient features make it particularly effective for synonym detection, whereas cosine similarity's global vector overlap is more diffuse.

Contextual word and sentence embeddings (ELMo, BERT, USE, context2vec) further enable instance-level WSM—critical for capturing polysemy and graded usage similarity (Soler et al., 2019). BERT's target-token vectors, in particular, yield the strongest unsupervised correlation (ρ ≈ 0.51), and when combined with context2vec-derived substitute features in a linear model, can produce superior results in both graded and binary usage similarity tasks.

3. Optimal Transport and Word/Sentence Matching

The Word Mover’s Distance (WMD) generalizes WSM to align entire texts by solving an optimal transport (OT) problem over embeddings:

$WMD(x, x') = \min_{T \ge 0} \sum_{i,j} T_{ij} \|v_i - v_j\|_2, \text{ subject to marginals from normalized BOW}$

WMD captures the minimum cumulative cost to “move” the distribution of one bag-of-words into another within the embedding space (Sato et al., 2021). However, its empirical superiority depends critically on proper normalization of baselines. When BOW and TF-IDF are L1-normalized, WMD offers only modest gains (5–8%) at much higher computational cost. In high-dimensional embedding spaces, the WMD reduces to the L1 distance between document histograms, diminishing the utility of the embedding geometry (Sato et al., 2021).

Advances in OT-based WSM address missing linguistic structure:

SynWMD augments WMD with importance weights computed via corpus-scale weighted PageRank over syntactic co-occurrence graphs, and injects subtree embedding similarity into the ground cost. This yields +2.5 to +4.6 point improvements in Spearman’s $\rho$ over WMD+IDF on standard STS tasks (Wei et al., 2022).
WSMD (Word and Structure Mover’s Distance) fuses BERT self-attention matrices’ structural cost with embedding costs via the Fused Gromov-Wasserstein metric, significantly enhancing paraphrase identification AUC while preserving performance on semantic textual similarity (Yamagiwa et al., 2022).

4. Feature-Based and Confusion-Based Approaches

A distinct paradigm frames WSM as classification confusion. The Word Confusion metric leverages a classifier trained to predict masked words in context and operationalizes similarity as the probability of confounding one word for another across contextualized embeddings (Zhou et al., 8 Feb 2025):

$sim_{wc}(t_1,t_2) = \frac{1}{2}[\,p(t_2|t_1) + p(t_1|t_2)\,]$

where $p(t_2|t_1)$ is the average probability the classifier misclassifies $t_1$ ’s context as $t_2$ .

Dynamic feature selection, inspired by Tversky, chooses the most discriminative confounders for each word pair, further improving model alignment with human similarity judgments on MEN, WordSim353, and SimLex-999. This approach enables finer analysis, including diachronic semantic shift, as demonstrated on the evolving sense of “révolution” in historical French corpora (Zhou et al., 8 Feb 2025).

5. Evaluation Datasets and Protocols

Robust WSM evaluation relies on carefully designed datasets ensuring coverage, POS balance, frequency bands, and explicit annotation of similarity (versus association):

Language-specific benchmarks: JWSD for Japanese includes 4,851 pairs over verbs, adjectives, nouns, adverbs, and frequency bands (Sakaizawa et al., 2017). COS960 for Chinese consists solely of two-morpheme word pairs, annotated for similarity on a 0–4 scale, differentiating from relatedness (Huang et al., 2019).
Contextual similarity: CoSimLex provides continuous, context-dependent similarity ratings in two distinct natural sentences per pair, supporting explicit evaluation of context-sensitivities of embedding models via change prediction and absolute similarity (Armendariz et al., 2019).
Usage similarity: Graded and binary similarity of word senses in context are supported by datasets and protocols focusing on both context overlap and discrete sense separation (Soler et al., 2019).

Typical metrics include Spearman’s rank correlation ρ between model-predicted and gold similarity scores, Pearson’s r, and more task-specific accuracy, e.g., for WSD or paraphrase identification.

Dataset / Language	#Pairs	Coverage	Annotation Scale	Inter-Annotator Agreement
JWSD (JA)	4,851	POS+freq-balance	Integer 0–10	ρ: 0.69 (verbs), 0.67 (adjs)
COS960 (ZH)	960	2-morpheme, POS	Real 0–4	Krippendorff α > 0.7
CoSimLex (EN, etc.)	333/111	Contextual, graded	Real 0–6 (per context)	ρ ≈ 0.77–0.80

6. Polysemy, Sense Disambiguation, and Sense Embeddings

WSM accuracy degrades in the presence of polysemy. Multi-prototype embeddings—where each word sense has its own vector—offer a solution. The SWSDS model, for Chinese, integrates unsupervised WSD (using OpenHowNet sememes and synonym sets) with synonym-averaged sense embeddings and achieves a +4.0 percentage-point gain on semantic similarity prediction over single-prototype word2vec (Zhou et al., 2022).

In unsupervised WSD, context-aware semantic similarity (CASS) strategies embed both the original context and contexts with candidate synonyms substituted for the target, and select the sense inducing the minimal perturbation of contextual meaning. On the CoarseWSD-20 benchmark, BERT-based CASS achieves 77.7% accuracy, outperforming both random and most-frequent-sense baselines (Martinez-Gil, 2023).

7. Web-Scale and Information-Theoretic Measures

The Normalized Web Distance (NWD) computes similarity from the global co-occurrence statistics available via search engine hit counts, approximating Kolmogorov complexity–derived information distances:

$NWD(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)}{\log N - \min\{\log f(x), \log f(y)\}}$

where $f(x)$ , $f(y)$ are the counts for each term, $f(x,y)$ for their conjunction, and $N$ the index size. NWD empirically clusters semantic categories, reconstructs translation pairs, and achieves high agreement with WordNet categories (mean 0.87). Limitations include non-metricity, lack of antonymy discrimination, and dependence on search engine noise (0905.4039).

Word Similarity Matching encompasses a rich spectrum of methods, from taxonomy-driven and distributional approaches to optimal transport, classifier confusion, and web-scale information distances. Each methodology offers distinct advantages depending on the linguistic phenomena, data scale, and application context. Current research continues to address challenges in context-sensitivity, multi-prototype representation, structured and syntactic integration, and the search for evaluation protocols that reliably predict human semantic intuitions across linguistic and cultural contexts.