Lexical Diversity Metrics Overview
- Lexical diversity metrics are quantitative indices assessing vocabulary richness and complexity, enabling analysis across genres and languages.
- They encompass classical measures like TTR and extensions such as MTLD, HD-D, and entropy-based indices to mitigate sample size effects.
- Recent approaches integrate probabilistic sampling, ecological indices, and deep-learning frameworks to offer multidimensional insights into textual diversity.
Lexical diversity metrics are quantitative indices devised to assess the variability, richness, and structural complexity of vocabulary in natural language texts. These indices are fundamental to both linguistic analysis and practical NLP pipeline design, enabling researchers to compare texts, assess language acquisition, evaluate generated or translated content, and characterize corpus heterogeneity across genres, languages, or authorial style. Multiple metrics have emerged, each with distinct formal properties, sensitivity to text length, and interpretability. The choice of metric is non-trivial, particularly given the dual challenge of sample size dependency and the desire for length-invariance, as well as the need to reflect not just token-level novelty but also distributional and contextual vocabulary phenomena.
1. Classical Metrics: Type–Token Ratio and Its Derivatives
The most basic and longstanding metric is the @@@@1@@@@ (TTR), defined for a text of tokens and distinct types as . While trivial to compute, TTR is strongly confounded by sample size, with decreasing as increases due to sublinear vocabulary growth (Heaps’ law). Variants such as the Root TTR (), Guiraud’s , and Maas’s index
attempt to compensate for this bias through non-linear normalization, but all remain susceptible to significant drift as grows (Bestgen, 2023, Powell et al., 2022, Luis et al., 21 Nov 2025).
To model vocabulary growth as a function of length, Heaps’ law asserts (), while associated indices such as Mass–Markov TTR (MMTTR) use empirically fitted to correct TTR:
where is estimated from the corpus via linear regression in log–log space (Powell et al., 2022).
Shannon entropy is also frequently applied at the token or type level:
where is the empirical frequency of word type . Entropy grows logarithmically with vocabulary and is, for massive corpora, analytically related to TTR through Zipfian and Heapsian statistical structure:
revealing tight coupling of macro–micro lexical diversity (Rosillo-Rodes et al., 2024).
2. Length-Normalized and Local Sampling Metrics
Compensating for the sample size effect has driven the development of explicitly length-stabilized metrics.
2.1. Subsampling Index: HD–D
The HD–D index (hypergeometric diversity), also known as , estimates expected TTR in a random subsample of length :
where is the frequency of type . This achieves empirical near-invariance to text length, provided is chosen less than or equal to the shortest document under comparison (Bestgen, 2023).
2.2. Moving-Average and Segmentation Approaches
Moving-Average TTR (MATTR) applies a fixed-length sliding window (of size ) and averages local TTR values:
where is the number of unique types in window . This smooths the local diversity profile and is robust for texts tokens long (Bestgen, 2023, Kendro et al., 31 Jul 2025).
Measure of Textual Lexical Diversity (MTLD) divides a text into sequential “factors” until the running TTR in each factor drops below threshold (commonly $0.72$), then averages segment lengths:
where are complete segment lengths and is the trailing fragment (Powell et al., 2022, Fu et al., 2021, Luis et al., 21 Nov 2025). MTLD is especially stable for varied text lengths and is the default recommendation in studies of short, scientific, or keyphrase-augmented texts (Powell et al., 2022, Luis et al., 21 Nov 2025).
2.3. Probabilistic and Sequential Sampling
Methods such as Mean Segmental TTR (MSTTR), and Mean TTR in Sequential Samples (MTTRSS) average TTR over non-overlapping or randomly sampled fixed-length chunks. While these reduce local bias, they may retain some instability for short windows or highly inhomogeneous texts (Bestgen, 2023).
2.4. Expectation-Adjusted Normalization: PATTR, EAD
Penalty-Adjusted Type–Token Ratio (PATTR) augments the denominator with an absolute-deviation penalty to directly enforce a target response length :
This formulation neutralizes the pathological preference of classical metrics for short outputs in synthetic text filtering (Deshpande et al., 20 Jul 2025).
Expectation-Adjusted Distinct (EAD) replaces the denominator of the widely used Distinct- metric with the expected count of novel -grams given the vocabulary size and total -gram count under a uniform null-model:
where is number of unique -grams. EAD provides stable, length-insensitive measurement for dialog and generative tasks (Liu et al., 2022).
3. Diversity Indices from Ecology: Hill Numbers, Shannon, Simpson, Evenness
Carrasco et al. (2023) extend the ecological family of Hill numbers to lexical diversity:
gives plain richness (), yields the “effective number of types” through Shannon entropy (), and defines the Simpson index () (Carrasco et al., 2023). This parameterization allows tuning sensitivity to rare versus common tokens.
Evenness metrics, such as , where is abundance and the Shannon entropy over type frequencies, are used to summarize how uniformly tokens are distributed across types and are robust to sample size variation in human–machine comparison studies (Kendro et al., 31 Jul 2025). The diversity-to-richness ratio () further quantifies vocabulary usage uniformity in controlled vocabularies or metadata fields (Carrasco et al., 2023).
Yule’s K focuses on the concentration of repeated types:
where is the frequency of type . High reflects heavy repetition (low diversity); low signals high evenness (Cortes, 2021).
4. Cross-Document, Redundancy, and Generation-Specific Measures
Certain metrics are developed for comparing lexical diversity across multiple generated outputs—critical in NLG, MT, and synthetic data evaluation:
- Inter-sentence BLEU/chrF diversity (-BLEU, -chrF) calculates mean similarity across all output pairs from the same input, then inverts it:
This metric directly quantifies output variety rather than within-text richness and is substantially less length- or domain-sensitive than traditional indices (Burchell et al., 2022).
- n-gram Diversity Score (NDS):
where and are numbers of unique and total -grams, respectively. This is used to comparably assess prompt and response sets for generative models (Kambhatla et al., 23 May 2025).
- Compression Ratio (CR):
Lower CR indicates higher lexical unpredictability; it is applied both at token and POS-sequence levels to benchmark synthetic redundancy (Kambhatla et al., 23 May 2025, Deshpande et al., 20 Jul 2025).
- Self-Repetition (SR) and Homogenization with BERTScore extend diversity assessment to long-range, context-sensitive, and semantic redundancy, crucial for persona-prompting and LLM evaluation pipelines (Kambhatla et al., 23 May 2025).
5. Advanced, Multidimensional, and Model-Based Indices
Recent work emphasizes the limits of single-number metrics and the importance of multidimensional or data-driven assessment:
- Autoencoder Diversity Score (ADS): This framework posits that the neural network capacity required to reconstruct a corpus at a given accuracy is a direct proxy for its lexical diversity and complexity:
where is the reconstruction accuracy at hidden-layer width and accuracy threshold . ADS is robust to length, language, and highly duplicated samples, and captures not just token-level diversity but the contextual and structural intricacy of the corpus (Dang et al., 28 Feb 2025).
- Composite Dimensional Framework: Six dimensions—volume, abundance, variety–repetition (MATTR), evenness, disparity (semantic overlap via WordNet), and dispersion (proximity-based repetition)—allow for fine-grained discrimination between human and LLM output (Kendro et al., 31 Jul 2025). MATTR and dispersion, in combination with evenness and disparity, are especially effective for this purpose.
- Lexical Density and Categorically Conditioned TTRs are useful in morphologically rich or typologically diverse languages and can be systematically profiled for individual tokens classes (nouns, verbs, content words, etc.) (Luis et al., 21 Nov 2025).
The PUCP-Metrix package implements more than a dozen such measures (including TTR variants, MTLD, VoCD, Maas, lexical density, and lemma-based indices) for Spanish, offering comprehensive cross-category analysis (Luis et al., 21 Nov 2025).
6. Parameterization, Length Sensitivity, and Practical Recommendations
Regardless of metric, text length effects remain a central methodological concern. Evaluation by Carrasco et al. and others confirms:
- Raw TTR, and even transformed variants, retain strong downward length bias.
- Subsampling indices (HD–D, MATTR, MTLD) achieve high but not perfect length-invariance; their sensitivity to the window or threshold parameter must be empirically assessed on the target corpus (Bestgen, 2023).
- For any parameterized index, users should:
- Set the window/segment/threshold to match the shortest or most “typical” text in the dataset.
- Empirically verify parameter robustness (e.g., using ICC across length variation).
- Pair at least one length-invariant metric (e.g., MTLD, HD–D) with an untransformed TTR-type index for side-by-side reporting (Bestgen, 2023, Powell et al., 2022).
For NLG and filtering tasks, controlling text length (via PATTR or explicit truncation) is as important as maximizing the raw diversity measure, and adjustments of the “target length” parameter allow for application-specific tuning (Deshpande et al., 20 Jul 2025).
Researchers are encouraged to report: (i) raw diversity curve vs. sample size, (ii) parameter settings for all indices, and (iii) cross-length stability analysis to ensure meaningful, fair cross-document or cross-group comparison (Carrasco et al., 2023, Bestgen, 2023).
7. Comparative Table of Major Lexical Diversity Metrics
| Metric/Class | Formula / Method | Length Sensitivity |
|---|---|---|
| TTR | High | |
| RTTR | Moderate | |
| HD–D | see above (hypergeom exp.) | Low |
| MATTR | avg. unique types in window | Low (window-dependent) |
| MTLD | mean factor length until TTR | Low (threshold-dependent) |
| Hill diversity () | Moderate; can be extrapolated | |
| Entropy () | Moderate | |
| Yule’s K | Lower than TTR | |
| EAD | Minimal | |
| PATTR | Tunable (by ) | |
| Autoencoder Diversity | Min net width s.t. | Minimal |
| i-BLEU, i-chrF | across outputs | None (cross-output) |
Length-invariant or quantitatively adjusted measures are essential for robust comparison whenever document size is unconstrained or systematically varies.
In sum, the landscape of lexical diversity measurement is characterized by a progression from naive but interpretable type–token ratios, through length-stabilized and ecological indices, to advanced data-driven and multidimensional frameworks. The optimal metric depends on corpus properties, task requirements, and the need for length or genre comparability. Systematic reporting, empirical assessment of parameter sensitivity, and explicit normalization for text length or register are now considered best-practice in arXiv-scale analytic research.