Word-based Metrics Overview

Updated 31 August 2025

Word-based Metrics (WBMs) are quantitative measures that assess semantic similarity, association, and bias between words using methods like co-occurrence, embeddings, and optimal transport.
They incorporate various techniques such as Normalized Web Distance, PMI-based bias scores, Word Mover’s Distance, and ranking objectives to evaluate word relationships.
WBMs enhance applications in bias diagnosis, document similarity, and natural language understanding while addressing frequency effects and computational efficiency.

Word-based Metrics (WBMs) are quantitative measures that assess properties such as similarity, semantic association, prediction accuracy, or bias between words, word sequences, or phrase sets. WBMs span a variety of formal and empirical approaches: some rely on corpus-level co-occurrence statistics, some use vector representations from word embeddings, others employ optimal transport or geometric formulations, and some quantify group-based biases via direct comparison of association scores. This article reviews fundamental principles, recent developments, and critical limitations in WBMs with reference to key methods, including Normalized Web Distance, PMI-based bias metrics, embedding-driven similarity, and ranking-based retrieval objectives.

1. Foundations and Diversity of Word-based Metrics

The foundational principle of WBMs is the reduction of lexical, semantic, or associative properties between words or phrases to real-valued scores or distances suitable for downstream applications, statistical analysis, or algorithmic decision-making. Among principal categories:

Co-occurrence/statistical metrics: Employ hit counts or probability estimates, e.g., Pointwise Mutual Information (PMI), normalized web distance (NWD), or meaning-bound scores (0905.4039, Aerts, 2010).
Embedding-based metrics: Operate in high-dimensional continuous spaces, using cosine similarity, Euclidean distance, or optimal transport between word or sentence embeddings (Kilickaya et al., 2016, Sato et al., 2021, Valentini et al., 2022).
String-based and ranking objectives: Use edit distance or retrieval-specific objectives for direct optimization in tasks like word spotting (Riba et al., 2021).
Bias quantification metrics: Measure the association between target word groups (such as gendered contexts) and attribute sets using either co-occurrence-based (PMI/log odds ratio) or embedding-based approaches (Valentini et al., 2021, Schröder et al., 2021, Valentini et al., 2023).

The formalization of WBMs varies widely, but the central goal is the measurement of semantic, lexical, or statistical relationships consistent across varied domains and corpora.

2. Key Methodological Approaches

Co-occurrence and Statistical Methods

Normalized Web Distance (NWD): For terms $x$ and $y$ and corpus size $N$ , with search engine page counts $f(\cdot)$ :

$e_G(x, y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x, y)}{\log N - \min\{\log f(x), \log f(y)\}}$

NWD interprets joint web frequencies as indicators of semantic proximity, enabling scalable similarity calculation via aggregate search statistics (0905.4039).

Meaning Bound: Defined for words $A$ and $B$ :

$M(A, B) = \frac{n(A, B) \cdot n(\mathrm{www})}{n(A) \cdot n(B)}$

Larger $M(A, B)$ indicates stronger mutual association on the web (Aerts, 2010).

PMI-based Bias Metric:

$\text{Bias}_{\mathrm{PMI}} = \log \frac{P(C|A)}{P(C|B)}$

This captures direct comparison of conditional probabilities between group contexts (Valentini et al., 2021).

Embedding-based Methods, Optimal Transport, and Ranking

Word Mover's Distance (WMD): Minimum cumulative transport cost to move words in document $d$ to $d'$ using embedding distances. Given normalized bag-of-words vectors and cost matrix $c(w_i, w_j)$ :

$\mathrm{WMD} = \min_{T} \sum_{i,j} T_{ij}c(w_i, w_j)$

subject to marginal constraints on $T$ (Kilickaya et al., 2016, Sato et al., 2021).

Tempered WMD (TWMD): Adds entropic regularization for computational efficiency and statistical smoothness:

$\max_\pi \sum_{ij} \pi_{ij} \langle x_1^i, x_2^j \rangle - T\sum_{ij} \pi_{ij} \log \pi_{ij}$

using Sinkhorn iterations (Chen et al., 2020).

WEmbSim: Applies mean-of-word-embeddings (MOWE) and cosine similarity for unsupervised evaluation, e.g., for candidate caption $C$ and reference $R$ :

$\mathrm{Score}(C| R) = \frac{v(C) \cdot v(R)}{\|v(C)\|\|v(R)\|}$

with $v(C)$ as the average embedding vector (Sharif et al., 2020).

Ranking-based Retrieval Metrics: Directly optimize smooth approximations of AP and nDCG over word/image embeddings using differentiable loss functions (Riba et al., 2021).

3. Bias Quantification and Interpretation

Word-based metrics serve a central role in bias measurement and fairness diagnosis in NLP systems:

Cosine-based Embedding Bias (WEAT/MAC/Direct Bias): Use differences in mean cosine similarity between target and protected groups (often suffering from non-comparable extremes and context sensitivity) (Schröder et al., 2021).
SAME Metric (Editor’s term): Normalizes attribute group means for unbiased and magnitude-comparable scoring:

$b(w, A_i, A_j) = \cos(w, \hat{a}_i - \hat{a}_j)$

where $\hat{a}_i$ is the mean of attribute set $A_i$ ; guarantees trustworthiness and comparability (Schröder et al., 2021).

PMI-based Metrics: Offer improved interpretability, directly reflecting first-order context while supporting confidence intervals via log odds ratio approximation (Valentini et al., 2021, Valentini et al., 2023).

A recurring finding is that embedding-based metrics may inherit undesirable properties from training regimens—most notably, spurious dependence on word frequency, which can dominate bias scores and reverse conclusions if not controlled (Valentini et al., 2022, Valentini et al., 2023).

4. Frequency Effects and Metric Robustness

Statistical and embedding-based WBMs are susceptible to frequency artifacts:

Frequency-Related Similarity Distortion: High-frequency words in SGNS, GloVe, and FastText tend to exhibit inflated cosine similarity regardless of genuine semantics. Frequency distortion remains after corpus shuffling, underscoring its algorithmic rather than linguistic origin (Valentini et al., 2022, Valentini et al., 2023).
Effect on Bias: Bias metrics built on embeddings may report stronger bias for more frequent words, even in the absence of contextual evidence. PMI-based metrics remain relatively robust across frequency bins, though at the expense of increased variance for rare words (Valentini et al., 2023).
Mitigation: Strategies include frequency-balancing context sets, controlled resampling, or employing PMI/log odds ratio measures for higher reliability.

5. Computational Efficiency, Interpretability, and Practical Implications

Efficiency and interpretability are central in metric choice:

Computational Considerations: WMD and TWMD incur significant computational cost for long texts; batch mean centering and entropic regularization (TWMD) offer efficient alternatives with minimal degradation in accuracy (Chen et al., 2020).
Interpretability: PMI/log odds ratio-based approaches supply direct probabilistic meaning and facilitate the estimation of confidence intervals for bias scores, whereas embedding-based metrics often embed second-order relationships and corpus-level frequency information, reducing transparency (Valentini et al., 2021).
Evaluation in Downstream Tasks: Image captioning, word spotting, and document similarity tasks benefit from tailored metrics (WMD, TWMD, WEmbSim, ranking-based objectives), but metric performance must be validated against human judgment and statistical robustness (Kilickaya et al., 2016, Sharif et al., 2020, Riba et al., 2021).

6. Future Directions and Open Challenges

Hybrid and Contextual Approaches: Integration of distributional semantics, window-based co-occurrence, and contextual embeddings is recommended for improved robustness and sensitivity.
Frequency Artifact Correction: Further research into regularization, normalization, and explicit subtraction of frequency components in embedding spaces is needed for unbiased measurements (Valentini et al., 2022).
Metric Properties and Generalization: Theoretical clarification of desirable metric properties (trustworthiness, magnitude comparability, unbiasedness) is essential for future bias and fairness evaluations (Schröder et al., 2021).
Multimodal and Cross-linguistic Extensions: Cross-lingual alignment, concept-level metrics, and dynamic context adaptation are promising for wide applicability (0905.4039, Aerts, 2010).
Scalability and Efficiency: As web-scale resources and deep contextual embeddings become standard, scalable, batch-processed, or approximate variants of optimal transport and semantic association measures become critically important.

Word-based Metrics thus constitute a broad methodological area balancing statistical rigor, computational tractability, interpretability, and sensitivity to linguistic and corpus-level phenomena. As research advances, the careful design, theoretical analysis, and empirical validation of WBMs remain central to progress in natural language understanding, generation, and bias diagnosis.