Lexical & Embedding-Based Metrics

Updated 13 December 2025

Lexical and embedding-based metrics are quantitative measures that assess semantic similarity using surface text overlap and distributed vector representations.
Lexical approaches excel when texts share high surface form similarity, while embedding methods capture nuances such as paraphrase and semantic drift.
Hybrid methods combine both techniques to improve cross-lingual alignment and task-specific evaluations, enhancing reliability in applications like MT and code review.

Lexical and embedding-based metrics refer to a heterogeneous suite of quantitative measures designed to evaluate the semantic similarity, alignment, or appropriateness of linguistic expressions at the word, phrase, or sentence level. Lexical metrics operate primarily over surface forms (e.g., n-gram overlap), while embedding-based metrics rely on the geometric relationships within distributed vector spaces induced by data-driven or hybrid neural models. These metrics underpin the evaluation and design of nearly all modern NLP systems, including machine translation, lexical substitution, semantic drift detection, code review generation, stylistic scoring, and cross-lingual embedding alignment.

1. Fundamental Classes and Theoretical Distinctions

Lexical and embedding-based metrics are partitioned along two principal axes: the level of linguistic representation (surface form vs. distributed vector), and the aggregation granularity (pairwise/segment-level vs. corpus/system-level).

Lexical Similarity Metrics compute similarity based on explicit text overlap or prespecified lexical resources:

N-gram overlap measures (e.g., BLEU, chrF, ROUGE) quantify the proportion of shared surface substrings between candidate and reference texts.
Exact Match counts token-level overlap.
Thesaurus/graph-based scores utilize semantic networks such as WordNet or PPDB for symbolic path similarity or reachability.

Embedding-Based Metrics exploit the geometric structure of vector spaces (static or contextualized):

Cosine similarity between vector representations of words/sentences.
Distributional profile divergence, as in semantic change detection, compares embedding neighborhoods across corpora or time periods.
Rank-based and cloud-based metrics generalize or supplement cosine similarity by taking into account dimensional salience, ranking, or the full set of context-induced vectors (Dutkiewicz et al., 2017, Karidi et al., 7 Oct 2024).

Aggregation Approaches, notably in MT evaluation, impact metric faithfulness:

Corpus-level (average of ratios): e.g., BLEU_corpus streams all outputs and computes n-gram statistics globally.
Segment-level (ratio of averages): e.g., m-BLEU, which scores per sentence and then averages, provides stronger correlation with human judgment (Cavalin et al., 3 Jul 2024).

This typology dictates downstream behaviors, with lexical metrics functioning best under high surface similarity, and embedding-based metrics excelling in assessing paraphrase, semantic drift, or code-mixed language.

2. Lexical Metrics: Formalization, Aggregation, and Limitations

Lexical metrics are traditionally parameterized by strict text overlap, with the most prominent examples being BLEU and its variants: $\mathrm{BLEU_{corpus}} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$ where $p_n$ is clipped n-gram precision, $w_n$ is typically $1/N$, and BP is a brevity penalty.

Segment-level aggregation (m-BLEU, m-chrF) provides: $\mathrm{m\text{-}BLEU} = \frac{1}{n} \sum_{i=1}^n \frac{m_i}{w_i}$ where $m_i$ is the number of matches in segment $i$ and $w_i$ its n-gram count.

Recent empirical results on MT evaluation (Cavalin et al., 3 Jul 2024) show that segment-level averaging dramatically increases the Pearson correlation with human assessments (e.g., BLEU_corpus $r=0.425$ vs. m-BLEU $r=0.776$ for MQM), with analogous findings for chrF.

Known limitations of strict lexical metrics include their insensitivity to paraphrase ("We don’t need super here" vs. "Unnecessary call to super" yields BLEU ≈ 17.5 but high human score) and their overestimation of semantically unrelated but lexically similar outputs (BLEU ≈ 70.7 for non-equivalent reviews) (Jiang et al., 9 Jan 2025).

3. Embedding-Based Metrics: Cosine, Rank-Based, and Cloud Methods

Cosine similarity is foundational for embedding-based evaluation: $\mathrm{CosSim}(u,v) = \frac{u^\top v}{\|u\|\|v\|}$ It underpins system-level similarity estimation for both generated and reference texts, code review assessment, and word similarity tasks (Jiang et al., 9 Jan 2025, Bakarov, 2018).

Rank-based similarity, such as RESM (Dutkiewicz et al., 2017), gives exponential or top-k weight to the most salient vector dimensions: $\mathrm{RESM}(w_i, w_j) = \mathrm{RESM}^d(w_i, w_j) + \mathrm{RESM}^a(w_i, w_j)$ where $RESM^d$ and $RESM^a$ sum dimension-wise similarities with scores decaying exponentially by rank, ensuring maximal sensitivity to principal semantic axes. This approach substantially outperforms vanilla cosine on fine-grained synonym detection and robust clustering (e.g., TOEFL accuracy up to 97.5%).

Cloud-based cross-lingual metrics, such as SNC-CLOUD (Karidi et al., 7 Oct 2024), use distributions of contextualized embeddings ("clouds") to assess alignment via setwise distances (e.g., minimal inter-cloud cosine distance), successfully capturing polysemy and sense-distributional phenomena.

Divergence-based semantic change metrics (e.g., EmbLexChange (Asgari et al., 2020)) define a profile for a word relative to a fixed set of pivots, compute time/comparison-point softmax-normalized similarities, and use the Kullback–Leibler divergence to quantify semantic drift.

4. Hybridization with Lexical Knowledge: Retrofitting, Sprinkling, and Lexicon-Aware Metrics

Hybrid metrics and embeddings combine distributional (corpus-based) data with structured lexical resources:

Retrofitting optimizes vectors $q_i$ to be close to both their pre-trained value $\hat{q}_i$ and their lexicon neighbors, with weights reflecting semantic relatedness (e.g., based on WordNet path similarity or Jiang–Conrath) (Srinivasan et al., 2019, Dutkiewicz et al., 2017): $q_i \leftarrow \frac{\alpha_i\hat{q}_i + \sum_{j} \beta_{ij} q_j}{\alpha_i + \sum_j \beta_{ij}}$
Sprinkling augments the co-occurrence matrix with multi-hop lexical relations before SVD, enforcing that words connected in a lexical graph are closer in the embedding space (Srinivasan et al., 2019).
Hybrid similarity metrics further combine cosine similarity and graph-derived measures: $\mathrm{Sim}_{hybrid}(u, v) = \lambda \cos(E_u, E_v) + (1-\lambda) \mathrm{LexSim}(u, v)$ Hybrid methods yield statistically significant gains on both intrinsic (Spearman $\rho$ on SimLex-999) and extrinsic (POS, NER) benchmarks (Srinivasan et al., 2019).

5. Specialized Embedding-Based Metrics: Lexical Substitution, Stylistic Scoring, and Code Review

Lexical Substitution Evaluation, as in LexSubCon (Michalopoulos et al., 2021), employs:

Mix-up proposal probability: blending contextual input with synonym averages.
Gloss-sentence cosine: comparing contextualized embeddings of candidate and gold glosses.
Sentence similarity: employing fine-tuned encoders to penalize semantic drift upon substitution.
Candidate validation: tracking token-wise representation change.

Combined linearly, these metrics provide robust candidate rankings, yielding improvements on LS07 and CoInCo datasets, with performance ablations confirming complementary contributions from each signal.

Stylistic scoring in embedding space (Lyu et al., 2023) constructs "style vectors" (e.g., for complexity, formality, figurativeness) using seed paraphrase differences: $v_{\text{style}} = \frac{1}{N} \sum_{i=1}^N (E(y_i) - E(x_i))$ Cosine similarity to $v_{\text{style}}$ yields a continuous style metric applicable at the token, phrase, or sentence level. Correction for embedding anisotropy (ABTT, standardization) is critical for context-sensitive applications, especially for LMs.

Automated code review evaluation demonstrates that embedding-based metrics decisively improve correlation with human grades compared to lexical (BLEU/METEOR/ROUGE) scores (Spearman $\rho=0.38$ versus BLEU’s $\rho=0.22$ ). Prompt-based LLM assessment, with systematized criteria and majority voting, pushes correlation further (up to $\rho=0.49$ ), establishing a new level of human-likeness in automated text judgment (Jiang et al., 9 Jan 2025).

6. Alignment and Cross-Lingual Metrics: BLI, SNC, and Local Alignment

Bilingual Lexicon Induction (BLI) is the principle metric for alignment between embedding spaces (Wickramasinghe et al., 17 Nov 2025):

Precision@k: proportion of queries for which the correct translation is in the top $k$ candidates.
Mean Reciprocal Rank (MRR): average inverse rank position of the correct translation.

Stem-based BLI addresses surface mismatch in highly inflected languages, crediting matches on stems rather than strict surface forms: $\mathrm{score}^{\mathrm{stem}}_{@k} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\left[t_i \in R_k(s_i) \lor \mathrm{Stem}(t_i)\in \{\mathrm{Stem}(r): r\in R_k(s_i)\}\right]$

Vocabulary pruning restricts candidate retrieval to the appropriate script/language, dramatically increasing BLI scores on multilingual embeddings (e.g., LaBSE English→Sinhala P@1 increases from 4.5% to 46.6%).

Semantic Neighborhood Comparison (SNC) for local cross-lingual alignment (Karidi et al., 7 Oct 2024):

SNC-STATIC: rank-based Pearson correlation of neighborhood distances, per word.
SNC-AVE and SNC-CLOUD: employing contextualized embedding average and cloud representations to capture fine-grained, sense-specific, domain-dependent alignment.

Empirical studies demonstrate that SNC-CLOUD, stem-based BLI, and vocabulary pruning significantly improve reliability and robustness of cross-lingual evaluation, particularly for low-resource or highly inflected language pairs.

7. Evaluation Methodologies, Statistical Robustness, and Best Practices

Intrinsic evaluation metrics (correlation with human similarity ratings, analogy accuracy, synonym detection, outlier detection, semantic drift, and clustering purity) are complemented by extrinsic task benchmarks (NER, POS, classification, MT) (Bakarov, 2018).

Recommended best practices:

Prefer segment-level aggregation for lexical metrics in generation tasks (Cavalin et al., 3 Jul 2024).
Use context-aware and hybrid embedding-based metrics in semantic similarity, code review, or paraphrase assessment (Jiang et al., 9 Jan 2025, Michalopoulos et al., 2021).
For languages with morphology or mixed scripts, report both strict and stem-matched variants of alignment and retrieval metrics (Wickramasinghe et al., 17 Nov 2025).
In cross-lingual or polysemous settings, use SNC-based or cloud-based contextualized metrics for local alignment and sense-specific evaluation (Karidi et al., 7 Oct 2024).
Correct for embedding space anisotropy before performing stylistic or fine-grained similarity analysis (Lyu et al., 2023).

These approaches ensure reproducibility, strong alignment to human intuition, and statistical robustness across languages, domains, and downstream applications.