Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lexical Diversity Metrics Overview

Updated 13 March 2026
  • Lexical diversity metrics are quantitative indices assessing vocabulary richness and complexity, enabling analysis across genres and languages.
  • They encompass classical measures like TTR and extensions such as MTLD, HD-D, and entropy-based indices to mitigate sample size effects.
  • Recent approaches integrate probabilistic sampling, ecological indices, and deep-learning frameworks to offer multidimensional insights into textual diversity.

Lexical diversity metrics are quantitative indices devised to assess the variability, richness, and structural complexity of vocabulary in natural language texts. These indices are fundamental to both linguistic analysis and practical NLP pipeline design, enabling researchers to compare texts, assess language acquisition, evaluate generated or translated content, and characterize corpus heterogeneity across genres, languages, or authorial style. Multiple metrics have emerged, each with distinct formal properties, sensitivity to text length, and interpretability. The choice of metric is non-trivial, particularly given the dual challenge of sample size dependency and the desire for length-invariance, as well as the need to reflect not just token-level novelty but also distributional and contextual vocabulary phenomena.

1. Classical Metrics: Type–Token Ratio and Its Derivatives

The most basic and longstanding metric is the @@@@1@@@@ (TTR), defined for a text of NN tokens and VV distinct types as TTR=V/NTTR = V/N. While trivial to compute, TTR is strongly confounded by sample size, with TTRTTR decreasing as NN increases due to sublinear vocabulary growth (Heaps’ law). Variants such as the Root TTR (RTTR=V/NRTTR = V/\sqrt{N}), Guiraud’s RR, and Maas’s index

a=logNlogV(logN)2a = \frac{\log N - \log V}{(\log N)^2}

attempt to compensate for this bias through non-linear normalization, but all remain susceptible to significant drift as NN grows (Bestgen, 2023, Powell et al., 2022, Luis et al., 21 Nov 2025).

To model vocabulary growth as a function of length, Heaps’ law asserts V(N)CNβV(N) \sim C N^\beta (0<β<10 < \beta < 1), while associated indices such as Mass–Markov TTR (MMTTR) use empirically fitted β\beta to correct TTR:

MMTTR=VNβMMTTR = \frac{V}{N^\beta}

where β\beta is estimated from the corpus via linear regression in log–log space (Powell et al., 2022).

Shannon entropy is also frequently applied at the token or type level:

H=i=1VpilnpiH = -\sum_{i=1}^V p_i\,\ln p_i

where pip_i is the empirical frequency of word type ii. Entropy HH grows logarithmically with vocabulary and is, for massive corpora, analytically related to TTR through Zipfian and Heapsian statistical structure:

H(TTR)=A+Bln(TTR)+ln(C+Dln(TTR))H(\mathrm{TTR}) = A + B\,\ln(\mathrm{TTR}) + \ln(C + D\,\ln(\mathrm{TTR}))

revealing tight coupling of macro–micro lexical diversity (Rosillo-Rodes et al., 2024).

2. Length-Normalized and Local Sampling Metrics

Compensating for the sample size effect has driven the development of explicitly length-stabilized metrics.

2.1. Subsampling Index: HD–D

The HD–D index (hypergeometric diversity), also known as ATTRnATTR_n, estimates expected TTR in a random subsample of length nn:

HD-D(n)=1ni=1V[1(Nfin)(Nn)]\mathrm{HD\text{-}D}(n) = \frac{1}{n}\sum_{i=1}^V \Bigl[1 - \frac{{N-f_i\choose n}}{{N\choose n}}\Bigr]

where fif_i is the frequency of type ii. This achieves empirical near-invariance to text length, provided nn is chosen less than or equal to the shortest document under comparison (Bestgen, 2023).

2.2. Moving-Average and Segmentation Approaches

Moving-Average TTR (MATTR) applies a fixed-length sliding window (of size ww) and averages local TTR values:

MATTR(w)=1Nw+1j=1Nw+1VjwMATTR(w) = \frac{1}{N-w+1} \sum_{j=1}^{N-w+1} \frac{V_j}{w}

where VjV_j is the number of unique types in window jj. This smooths the local diversity profile and is robust for texts 100\gtrsim100 tokens long (Bestgen, 2023, Kendro et al., 31 Jul 2025).

Measure of Textual Lexical Diversity (MTLD) divides a text into sequential “factors” until the running TTR in each factor drops below threshold (commonly $0.72$), then averages segment lengths:

MTLD=i=1SLi+R/(1t)S+1\mathrm{MTLD} = \frac{\sum_{i=1}^S L_i + R/(1-t)}{S + 1}

where LiL_i are complete segment lengths and RR is the trailing fragment (Powell et al., 2022, Fu et al., 2021, Luis et al., 21 Nov 2025). MTLD is especially stable for varied text lengths and is the default recommendation in studies of short, scientific, or keyphrase-augmented texts (Powell et al., 2022, Luis et al., 21 Nov 2025).

2.3. Probabilistic and Sequential Sampling

Methods such as Mean Segmental TTR (MSTTR), and Mean TTR in Sequential Samples (MTTRSS) average TTR over non-overlapping or randomly sampled fixed-length chunks. While these reduce local bias, they may retain some instability for short windows or highly inhomogeneous texts (Bestgen, 2023).

2.4. Expectation-Adjusted Normalization: PATTR, EAD

Penalty-Adjusted Type–Token Ratio (PATTR) augments the denominator with an absolute-deviation penalty to directly enforce a target response length LTL_T:

PATTR(w;LT)=V(w)L+LLTPATTR(w; L_T) = \frac{V(w)}{L + |L - L_T|}

This formulation neutralizes the pathological preference of classical metrics for short outputs in synthetic text filtering (Deshpande et al., 20 Jul 2025).

Expectation-Adjusted Distinct (EAD) replaces the denominator of the widely used Distinct-nn metric with the expected count of novel nn-grams given the vocabulary size VV and total nn-gram count CC under a uniform null-model:

EADn(R)=NV[1(V1V)C]EAD_n(R) = \frac{N}{V \big[1 - \left( \frac{V-1}{V} \right)^C \big]}

where NN is number of unique nn-grams. EAD provides stable, length-insensitive measurement for dialog and generative tasks (Liu et al., 2022).

3. Diversity Indices from Ecology: Hill Numbers, Shannon, Simpson, Evenness

Carrasco et al. (2023) extend the ecological family of Hill numbers to lexical diversity:

D[k]=(n=1Npnk)1/(1k)D^{[k]} = \Bigl( \sum_{n=1}^N p_n^k \Bigr)^{1/(1-k)}

k=0k=0 gives plain richness (D[0]=ND^{[0]}=N), k=1k=1 yields the “effective number of types” through Shannon entropy (D[1]=exp(H)D^{[1]} = \exp(H)), and k=2k=2 defines the Simpson index (D[2]=[pn2]1D^{[2]} = [\sum p_n^2]^{-1}) (Carrasco et al., 2023). This parameterization allows tuning sensitivity to rare versus common tokens.

Evenness metrics, such as E=H/lnAE = H'/\ln A, where AA is abundance and HH' the Shannon entropy over type frequencies, are used to summarize how uniformly tokens are distributed across types and are robust to sample size variation in human–machine comparison studies (Kendro et al., 31 Jul 2025). The diversity-to-richness ratio (D/RD/R) further quantifies vocabulary usage uniformity in controlled vocabularies or metadata fields (Carrasco et al., 2023).

Yule’s K focuses on the concentration of repeated types:

K=104(wfw2N21N)K = 10^4 \cdot \left( \frac{\sum_w f_w^2}{N^2} - \frac{1}{N} \right )

where fwf_w is the frequency of type ww. High KK reflects heavy repetition (low diversity); low KK signals high evenness (Cortes, 2021).

4. Cross-Document, Redundancy, and Generation-Specific Measures

Certain metrics are developed for comparing lexical diversity across multiple generated outputs—critical in NLG, MT, and synthetic data evaluation:

  • Inter-sentence BLEU/chrF diversity (ii-BLEU, ii-chrF) calculates mean similarity across all output pairs from the same input, then inverts it:

i-BLEU=11(N2)1i<jNBLEU(xi,xj)i\text{-}BLEU = 1 - \frac{1}{\binom{N}{2}} \sum_{1 \le i < j \le N} \mathrm{BLEU}(x_i, x_j)

This metric directly quantifies output variety rather than within-text richness and is substantially less length- or domain-sensitive than traditional indices (Burchell et al., 2022).

NDS=n=1NVnn=1NTnNDS = \frac{\sum_{n=1}^N V_n}{\sum_{n=1}^N T_n}

where VnV_n and TnT_n are numbers of unique and total nn-grams, respectively. This is used to comparably assess prompt and response sets for generative models (Kambhatla et al., 23 May 2025).

  • Compression Ratio (CR):

CR(x)=gzip(x)xCR(x) = \frac{|\mathrm{gzip}(x)|}{|x|}

Lower CR indicates higher lexical unpredictability; it is applied both at token and POS-sequence levels to benchmark synthetic redundancy (Kambhatla et al., 23 May 2025, Deshpande et al., 20 Jul 2025).

  • Self-Repetition (SR) and Homogenization with BERTScore extend diversity assessment to long-range, context-sensitive, and semantic redundancy, crucial for persona-prompting and LLM evaluation pipelines (Kambhatla et al., 23 May 2025).

5. Advanced, Multidimensional, and Model-Based Indices

Recent work emphasizes the limits of single-number metrics and the importance of multidimensional or data-driven assessment:

  • Autoencoder Diversity Score (ADS): This framework posits that the neural network capacity required to reconstruct a corpus at a given accuracy is a direct proxy for its lexical diversity and complexity:

ADS(X)=min{W:A(W;X)τ}ADS(X) = \min \{ W : A(W; X) \geq \tau \}

where A(W;X)A(W; X) is the reconstruction accuracy at hidden-layer width WW and accuracy threshold τ\tau. ADS is robust to length, language, and highly duplicated samples, and captures not just token-level diversity but the contextual and structural intricacy of the corpus (Dang et al., 28 Feb 2025).

  • Composite Dimensional Framework: Six dimensions—volume, abundance, variety–repetition (MATTR), evenness, disparity (semantic overlap via WordNet), and dispersion (proximity-based repetition)—allow for fine-grained discrimination between human and LLM output (Kendro et al., 31 Jul 2025). MATTR and dispersion, in combination with evenness and disparity, are especially effective for this purpose.
  • Lexical Density and Categorically Conditioned TTRs are useful in morphologically rich or typologically diverse languages and can be systematically profiled for individual tokens classes (nouns, verbs, content words, etc.) (Luis et al., 21 Nov 2025).

The PUCP-Metrix package implements more than a dozen such measures (including TTR variants, MTLD, VoCD, Maas, lexical density, and lemma-based indices) for Spanish, offering comprehensive cross-category analysis (Luis et al., 21 Nov 2025).

6. Parameterization, Length Sensitivity, and Practical Recommendations

Regardless of metric, text length effects remain a central methodological concern. Evaluation by Carrasco et al. and others confirms:

  • Raw TTR, and even transformed variants, retain strong downward length bias.
  • Subsampling indices (HD–D, MATTR, MTLD) achieve high but not perfect length-invariance; their sensitivity to the window or threshold parameter must be empirically assessed on the target corpus (Bestgen, 2023).
  • For any parameterized index, users should:
    • Set the window/segment/threshold to match the shortest or most “typical” text in the dataset.
    • Empirically verify parameter robustness (e.g., using ICC across length variation).
    • Pair at least one length-invariant metric (e.g., MTLD, HD–D) with an untransformed TTR-type index for side-by-side reporting (Bestgen, 2023, Powell et al., 2022).

For NLG and filtering tasks, controlling text length (via PATTR or explicit truncation) is as important as maximizing the raw diversity measure, and adjustments of the “target length” parameter LTL_T allow for application-specific tuning (Deshpande et al., 20 Jul 2025).

Researchers are encouraged to report: (i) raw diversity curve vs. sample size, (ii) parameter settings for all indices, and (iii) cross-length stability analysis to ensure meaningful, fair cross-document or cross-group comparison (Carrasco et al., 2023, Bestgen, 2023).

7. Comparative Table of Major Lexical Diversity Metrics

Metric/Class Formula / Method Length Sensitivity
TTR V/NV/N High
RTTR V/NV/\sqrt{N} Moderate
HD–D see above (hypergeom exp.) Low
MATTR avg. unique types in window ww Low (window-dependent)
MTLD mean factor length until TTR t\leq t Low (threshold-dependent)
Hill diversity (D[k]D^{[k]}) (pnk)1/(1k)(\sum p_n^k)^{1/(1-k)} Moderate; can be extrapolated
Entropy (HH) pnlnpn-\sum p_n \ln p_n Moderate
Yule’s K 104(fw2/N21/N)10^4 \cdot (\sum f_w^2/N^2 - 1/N) Lower than TTR
EAD N/[V(1(V1V)C)]N / \left[ V(1-(\frac{V-1}{V})^C) \right] Minimal
PATTR V/(L+LLT)V/(L + |L - L_T|) Tunable (by LTL_T)
Autoencoder Diversity Min net width WW s.t. A(W;X)τA(W;X)\geq\tau Minimal
i-BLEU, i-chrF 1mean pairwise sim1-\text{mean pairwise sim} across outputs None (cross-output)

Length-invariant or quantitatively adjusted measures are essential for robust comparison whenever document size is unconstrained or systematically varies.


In sum, the landscape of lexical diversity measurement is characterized by a progression from naive but interpretable type–token ratios, through length-stabilized and ecological indices, to advanced data-driven and multidimensional frameworks. The optimal metric depends on corpus properties, task requirements, and the need for length or genre comparability. Systematic reporting, empirical assessment of parameter sensitivity, and explicit normalization for text length or register are now considered best-practice in arXiv-scale analytic research.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lexical Diversity Metrics.