Lexical Diversity Measures

Updated 17 November 2025

Lexical Diversity Measures are quantitative frameworks that assess vocabulary richness and balance via probability distributions over word types.
They integrate metrics like TTR, Shannon entropy, and Hill numbers to capture key aspects such as variety, evenness, and effective vocabulary size.
Practical applications span NLP, synthetic text evaluation, and corpus analysis, with recommendations for length normalization and multi-metric triangulation.

Lexical diversity measures provide quantitative frameworks for assessing the richness, variety, and distributional balance of vocabularies in language data. These metrics are central to computational linguistics, digital library science, textual analysis, NLP dataset creation, and the evaluation of synthetic text generation. Modern research in lexical diversity formalizes the concept using probability distributions over word or lemma types, with measures derived from information theory, ecology, and corpus linguistics. The following sections delineate foundational definitions, principal classes of metrics, normalization strategies, empirical findings, practical applications, and current methodological issues.

1. Foundational Definitions and Probability Models

Lexical diversity in a finite text or corpus is modeled as a probability distribution over observed word types (or lemmas). Given $m$ tokens and $n$ types, each type $i$ occurs with empirical frequency $f_i$ and probability $p_i=f_i/m$ . The central idea is to measure both the variety (distinct types) and the balance (evenness of frequencies) of this distribution $\Delta = \{p_1,\ldots,p_n\}$ (Estève et al., 14 Jan 2025).

Table: Foundational Measures and Interpretations

Metric / Index	Formula	Interpretation
Type–Token Ratio (TTR)	$TTR = n/m$	Pure variety; length-biased
Shannon Entropy ( $H$ )	$H = -\sum_{i=1}^{n} p_i \log p_i$	Mix of variety and balance
Rényi Entropy ( $H_\alpha$ )	$H_\alpha = \frac{1}{1-\alpha} \log\sum_i p_i^\alpha$	Tunable (rare/common weights)
Hill Number ( $D^{[k]}$ )	$D^{[k]} = (\sum_n p_n^k)^{1/(1-k)}$	"Effective" vocab size

TTR is straightforward but suffers from monotonic decline with increasing text length (Heaps’ law, $V(N) \sim N^\beta$ with $0 < \beta < 1$ ) (Rosillo-Rodes et al., 2024). Entropy-based measures—Shannon and Rényi—capture both how many types exist and how tokens are distributed among them. The Hill numbers unify these into an "effective vocabulary size" interpretation, applicable for direct cross-domain comparisons (Carrasco et al., 2023).

2. Principal Families of Lexical Diversity Metrics

Research literature distinguishes several broad classes of lexical diversity metrics, each with distinct theoretical and practical properties:

A. Global Diversity Indices

Type–Token Ratio (TTR): $TTR = n/m$ . Rapidly decreases with length; sensitive to rare types (Bestgen, 2023).
Yule’s K: $K = 10^4 \frac{\sum_{i=1}^{N} i^2 f_i - N}{N^2}$ , where $f_i$ is the count of types occurring $i$ times. Higher $K$ means more repetition, less diversity; less sensitive to length than TTR (Cortes, 2021).
Hill Numbers / Effective Species: Special cases ( $k=0$ richness, $k=1$ Shannon diversity, $k=2$ Simpson diversity) provide interpretable "number of equally frequent types" (Carrasco et al., 2023). $D^{[1]} = \exp(H)$ .

B. Local/Windowed Indices

MATTR: Moving-average TTR in windows of fixed size ( $W$ ): $\text{MATTR}(w;W) = \frac{1}{L-W+1}\sum_{i=0}^{L-W} \frac{\lvert\mathrm{set}(w_{i:i+W-1})\rvert}{W}$ (Kendro et al., 31 Jul 2025).
MTLD: Segments text at points where running TTR drops below a threshold (e.g., 0.72), average segment length quantifies diversity (Dang et al., 28 Feb 2025).
Distinct-n / N-Gram Diversity Score (NDS): Ratio of unique n-grams to total, averaged over $n\in\{1,\ldots,N\}$ (Kambhatla et al., 23 May 2025).

C. Redundancy and Compression

Compression Ratio (CR): $CR(T) = \left|\mathrm{Compress}(T)\right| / |T|$ ; lower CR signals more repetition (Kambhatla et al., 23 May 2025).
POS-Compression Ratio (CR-POS): Same as CR, but applied to POS-tag sequence for syntactic redundancy.

D. Edit-Distance and Overlap

Bag-of-Words Overlap: $L_{BOW}(s,p) = 1 - \frac{\sum_t \min(\mathrm{count}_s(t), \mathrm{count}_p(t))}{\sum_t \mathrm{count}_s(t)}$ (Jayawardena et al., 2024).
Jaccard Token Overlap: $L_{Jaccard}(s,p) = 1 - \frac{|T_s \cap T_p|}{|T_s \cup T_p|}$ .
BLEU/ROUGE Inversions: $L_{BLEU} = 1 - BLEU$ , $L_{ROUGE} = 1 - ROUGE$ (Jayawardena et al., 2024).
Self-Repetition (SR): Average pairwise n-gram overlap across outputs (Kambhatla et al., 23 May 2025).

E. Novel Metrics

Penalty-Adjusted TTR (PATTR): $PATTR(w;L_T) = \frac{|\mathrm{set}(w)|}{|w| + |\,|w| - L_T|}$ ; penalizes deviation from target length, mitigating short-text bias (Deshpande et al., 20 Jul 2025).

3. Length Normalization and Robust Measurement Strategies

Length bias is intrinsic to many measures, especially TTR and CR. Modern best practices prescribe length control via two main classes of strategies:

Probabilistic Subsampling: Randomly sample fixed-length segments ( $m$ tokens), compute LD metric per sample, and average (Bestgen, 2023).
Deterministic Window/Chunking: Partition text into non-overlapping or overlapping segments of fixed window size (MATTR, MSTTR), aggregate the diversity scores (Bestgen, 2023).

Metrics like HD–D (hypergeometric D) achieve near-complete length invariance when $n$ is set to the shortest text in the set. Intraclass correlation (ICC) is used to empirically check robustness: ICC $> .90$ denotes insensitivity to arbitrary variation in text length. Measures such as MATTR ( $W=50$ ) are recommended for a balance of sensitivity and stability (Bestgen, 2023). PATTR is specifically designed to mitigate bias from prompt-induced response length variation in synthetic data (Deshpande et al., 20 Jul 2025).

4. Empirical Findings Across Corpora and Applications

Systematic analysis of diverse corpora—including fiction, library metadata, and synthetic LLM-generated text—yields several domain-general results:

Zipf’s and Heaps’ Law Underpinning: The empirical relationship between Shannon entropy $H$ and TTR is analytically derived from Zipfian frequency distributions and Heaps’ vocabulary growth law: $H(TTR) = A + B \log TTR + \log(C + D \log TTR)$ matches very large datasets (gigaword scale) across languages and genres (Rosillo-Rodes et al., 2024).
Diversity Indices in Metadata: Shannon diversity and the Hill numbers successfully measure effective vocabulary size in author and subject metadata. Evenness ratio $D/R$ indicates balance in type usage—values near 1 signal even exploitation of available terms (Carrasco et al., 2023).
LLM vs. Human Lexical Diversity: Recent studies demonstrate that LLM-generated texts systematically diverge from human benchmarks in six dimensions: volume, abundance, MATTR, evenness, disparity (WordNet), and dispersion (local repetition). SVM classifiers achieve >97% accuracy in discriminating LLM from human text using MATTR, evenness, disparity, dispersion (Kendro et al., 31 Jul 2025).
Synthetic Data and Persona Prompting: Compression ratio, NDS, SR, and Homogenic BERTScore collectively document redundancy and diversity levels. Length control is critical; persona detail increases diversity only in large models, and fine-grained personas do not outperform coarse-grained versions on any metric (Kambhatla et al., 23 May 2025).
Machine Translation and Literary Style: Tailored recovery (classifier-driven n-best reranking) can restore lost diversity in MT outputs to levels close to human translations, but effectiveness depends on individual book style (as measured by TTR, Yule’s I, and MTLD) (Ploeger et al., 2024).
Sampling for Diversity: Greedy sampling heuristics that maximize Shannon or Rényi entropy yield samples 350 $\sigma$ above random in entropy, but lexical and syntactic entropies do not correlate stably across genres or corpus partitions (Estève et al., 14 Jan 2025).

5. Metric Selection, Practical Recommendations, and Limitations

Selecting an appropriate diversity metric is task- and application-dependent. Key recommendations include:

For global corpus comparisons with variable-length texts, prefer asymptotically normalizing indices (HD–D, MATTR with windowed sampling, PATTR for synthetic data).
Use several complementary measures—richness ( $R$ , TTR), balance (Shannon $H$ , evenness), and repetition (Yule’s K, CR, SR)—to capture multiple facets, as individual metrics are weakly correlated and may miss substantial diversity signals (Kambhatla et al., 23 May 2025).
Disclose and sensitivity-check all parameter choices (window size, subsample length, segmentation thresholds).
In synthetic and generative pipelines, always control for output length when measuring diversity to avoid spurious inflation due to short-text bias (Deshpande et al., 20 Jul 2025).
For semantic diversity or content-level distinctness, surface metrics (CR, NDS) must be complemented by embedding-based measures (Hom-BS), though current methods remain limited by computational cost and domain biases.

6. Theoretical Issues and Future Directions

The spectrum of Rényi entropies ( $H_\alpha$ ) formalizes the tradeoff between sensitivity to rare forms and balance. Tuning $\alpha$ allows fate of diversity measures to be reconciled with the research aims: rare-word coverage ( $\alpha\rightarrow0$ ), robustness to repetition ( $\alpha\rightarrow2$ ) (Estève et al., 14 Jan 2025). Current open directions include formal exploration of mixed lexical-syntactic diversity, active annotation loops, semantic or discourse-level diversity quantification, and refinement of disparity/dispersion metrics for more precise cross-comparison in multilingual and genre-rich corpora.

Researchers are urged to triangulate metrics, disclose parameterizations, and validate robustness empirically if relying on automatic diversity quantification for comparative studies, dataset sampling, or synthetic data filtering. The increasing availability of large-scale annotated corpora and advanced generative models further motivates continued evolution of length-robust and content-aware lexical diversity measures.