Average Normalized Levenshtein Distance (ANLD)

Updated 2 December 2025

Average Normalized Levenshtein Distance (ANLD) is a metric that quantifies character-level edit differences by normalizing the edit distance relative to word length.
It normalizes distances to allow direct comparison across tokens and languages, effectively detecting over-aggressive modifications in NLP normalization pipelines.
ANLD serves as an objective tool for automated language phylogeny, providing a reproducible scalar measure as an alternative to subjective expert judgments.

Average Normalized Levenshtein Distance (ANLD) is a quantitative metric that measures the average normalized character-level edit distance between pairs of words and plays a central role in both NLP evaluation pipelines and automated comparative historical linguistics. ANLD enables the detection of over-aggressive modifications during normalization procedures in NLP, and serves as an objective, automated substitute for expert judgment in computational phylogenetics. By normalizing for word length, ANLD ensures comparability across tokens and across different languages or normalization regimes, offering a bounded, interpretable scalar that reflects the degree of surface form similarity.

1. Formal Definition and Mathematical Formulation

The Levenshtein distance $d_\mathrm{LD}(s, t)$ between two strings $s$ and $t$ is the minimum number of single-character operations (insertions, deletions, or substitutions) required to transform $s$ into $t$ . For words $w_1$ and $w_2$ over an alphabet, the normalized Levenshtein distance is

$d_\mathrm{norm}(w_1, w_2) = \frac{d_\mathrm{LD}(w_1, w_2)}{L(w_1, w_2)}$

where $L(w_1, w_2)$ is a normalization term, commonly either $|w_1|$ , $|w_2|$ , or $\max(|w_1|, |w_2|)$ . State-of-the-art practice in computational linguistics employs normalization by either the original word length or the maximum of both word lengths depending on task convention (Kafi et al., 25 Nov 2025, 0912.0884, Serva, 2011).

Given a dataset of $N$ word pairs $(w_i, \hat{w}_i)$ , the Average Normalized Levenshtein Distance is

$\text{ANLD} = \frac{1}{N}\sum_{i=1}^{N} \frac{d_\mathrm{LD}(w_i, \hat{w}_i)}{\ell(w_i)}$

where $\ell(w_i)$ denotes the normalization term. In historical linguistics, when comparing languages $a$ and $b$ across a list of $M$ aligned meanings, this generalizes to

$D(a, b) = \frac{1}{M}\sum_{i=1}^{M}\frac{d_\mathrm{LD}(a_i, b_i)}{\max(|a_i|, |b_i|)}$

with $a_i, b_i$ representing words expressing meaning $i$ in languages $a$ and $b$ , respectively (Serva, 2011, 0912.0884).

2. Procedures for Computation

The computation of ANLD in evaluation pipelines proceeds in well-defined steps (Kafi et al., 25 Nov 2025, 0912.0821, 0911.3292):

Pair Extraction: Obtain aligned word pairs, either as original/normalized forms in NLP or corresponding meanings in language comparison.
Preprocessing: Apply Unicode normalization (NFC/NFD), case-folding, and elimination of punctuation-only tokens to ensure reliable character-level matching.
Edit Distance Calculation: For each pair $(w, \hat{w})$ , compute $d_\mathrm{LD}(w, \hat{w})$ , and normalize by the prescribed denominator (state-of-the-art: $|w|$ for NLP normalization, $\max(|w|, |\hat{w}|)$ for historical linguistics).
Averaging: Accumulate normalized distances and divide by the total number of pairs.

The process is stable with respect to the range of edit costs (typically unit cost per operation) and robust to token length variability due to per-pair normalization.

Pseudocode

def compute_anld(word_pairs):
    sum_norm = 0
    N = len(word_pairs)
    for w, w_hat in word_pairs:
        d = levenshtein(w, w_hat)
        norm = d / max(1, len(w))  # or max(len(w), len(w_hat)) per the task
        sum_norm += norm
    return sum_norm / N

This algorithm is directly implemented in empirical evaluation and phylogenetic pipelines (Kafi et al., 25 Nov 2025, 0912.0821).

3. Interpretive Properties and Theoretical Analysis

Boundedness: $0 \leq \text{ANLD} \leq 1$ by construction. Zero values indicate identity, while values near one indicate maximal character-level divergence (i.e., all characters require edit).
Sensitivity: The normalization ensures that changes to short words are weighted appropriately; a single substitution in a two-letter word yields $0.5$, whereas in an eight-letter word it yields $0.125$ (Serva, 2011).
Symmetry: At the pair level, Levenshtein distance is symmetric; when the denominator is symmetric (e.g., using max-length), $d_\mathrm{norm}$ is symmetric.
Metric Properties: The (normalized) Levenshtein distance is a metric (non-negativity, identity of indiscernibles, symmetry, triangle inequality), and averaging preserves the essential interpretive properties (0912.0884).
Interpretability: ANLD quantifies the aggregate extent of character-level surface editing, serving as a "micro-level safety gate" to detect aggressive surface-level divergence in normalization or inter-language comparison (Kafi et al., 25 Nov 2025).

4. Roles in NLP Evaluation and Phylogenetics

In modern NLP normalization frameworks, ANLD is deployed alongside task-utility and downstream performance metrics:

Metric	Measures	Typical Use
Stemming Effectiveness Score (SES)	Lexical compression and semantic similarity	Utility of normalization
Model Performance Delta (MPD)	Downstream model accuracy change	End-task impact
ANLD	Edit-based surface distortion	Safety/diagnosis

High SES can arise from over-stemming, which ANLD exposes by revealing high average normalized distances. Safe normalization is characterized by low ANLD; this flagging enables practitioners to distinguish between genuine efficiency gains and destructive alterations (Kafi et al., 25 Nov 2025).
In computational historical linguistics, ANLD yields a reproducible language distance matrix for phylogenetic inference, bypassing subjective cognate identification. The single-normalization approach empirically outperforms bi-normalized (ASJP) distances, which suppress phylogenetic signal by dividing out global orthographic similarity (0912.0884, 0912.0821).

5. Empirical Examples and Implementation Issues

Concrete NLP example (Kafi et al., 25 Nov 2025):

Word Pair	$d_\mathrm{LD}$	$\|w\|$	$d_\mathrm{norm}$
"playing"–"play"	3	7	0.43
"houses"–"hous"	1	6	0.17
"run"–"run"	0	3	0.00

Averaging gives $\text{ANLD} \approx (0.43 + 0.17 + 0)/3 \approx 0.20$ .

Operational considerations:

Use fast, robust edit distance libraries (e.g., C-optimized) to accommodate large datasets.
Unicode normalization is critical in cross-linguistic or diacritic-rich data.
Choice of normalization denominator must remain consistent throughout the evaluation or comparative paper.
When processing languages with multi-character graphemes, treat code points in accordance with language-specific requirements.
To prevent division by zero, substitute length with $1$ in the denominator when necessary (Kafi et al., 25 Nov 2025).

6. Comparative Approaches and Extensions

ANLD, as per Serva, Petroni, and collaborators, is calculated over meaning-aligned lists (e.g., Swadesh 100 or 200) and utilizes either the original or maximal word length for normalization (Serva, 2011, 0912.0821). Double-normalization strategies, such as the ASJP approach, adjust for overall orthographic affinity:

$D_s(\alpha, \beta) = \frac{D(\alpha, \beta)}{\Gamma(\alpha,\beta)}$

where $\Gamma$ is the cross-meaning average ( $i\neq j$ ). However, this adjustment reduces the correlation with phylogenetic signal, making the single-normalization ANLD preferable for vertical classification (0912.0884). Stability-aware subsampling based on ANLD-derived stability scores $S(i)$ can further focus analysis on the most phylogenetically informative items (0912.0821).

7. Applications and Limitations

ANLD is deployed in:

Safety validation of normalization in NLP pipelines (preventing destructive over-stemming) (Kafi et al., 25 Nov 2025).
Quantifying lexical divergence for automated language phylogeny reconstruction (Serva, 2011, 0912.0884).
Measuring the stability of meanings/items across language families for optimal concept list design (0912.0821).

Limitations include sensitivity to orthographic conventions, potential distortion due to erroneous token alignment, and the inability to model semantic shift or borrowing directly. In all applications, ANLD's interpretive strength lies in its transparency, reproducibility, and clear correspondence between character-level surface form differences and the scalar output (Kafi et al., 25 Nov 2025, 0911.3292).

References:

"A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines" (Kafi et al., 25 Nov 2025)
"Measures of lexical distance between languages" (0912.0884)
"Phylogeny and geometry of languages from normalized Levenshtein distance" (Serva, 2011)
"Lexical evolution rates by automated stability measure" (0912.0821)
"Automated words stability and languages phylogeny" (0911.3292)