Papers
Topics
Authors
Recent
Search
2000 character limit reached

TTR: Lexical Diversity & Corpus Analysis

Updated 13 March 2026
  • Type–Token Ratio (TTR) is a measure of lexical diversity defined as the ratio of unique word types to the total tokens in a text.
  • It applies mathematical formulations and empirical laws like Zipf's and Heaps' to model vocabulary growth and repetition patterns.
  • TTR informs practical applications in corpus design, synthetic text filtering, and entropy estimation, while requiring normalization due to length dependence.

The Type–Token Ratio (TTR) is a fundamental metric in quantitative linguistics, information theory, and statistical analysis of complex systems, widely used to quantify lexical diversity and vocabulary growth in large-scale corpora. Defined as the quotient of the number of distinct types (unique word forms or elements) and the total number of tokens (elemental instances), TTR encapsulates the degree of repetition or novelty present in a sequence. Its theoretical and applied significance is closely linked to statistical laws such as Zipf's law and Heaps' law, and its properties have motivated advanced generalizations and rigorous statistical modeling in diverse contexts (Rosillo-Rodes et al., 2024, Hidaka, 2013, Rosillo-Rodes et al., 3 Nov 2025, Font-Clos et al., 2014).

1. Fundamental Definitions and Mathematical Formulation

Let nn denote the total number of tokens in a text or sequence, and let V(n)V(n) be the number of distinct types observed among those tokens. The Type–Token Ratio is defined as

TTR(n)=V(n)n.\mathrm{TTR}(n) = \frac{V(n)}{n}\,.

A high TTR indicates low repetition and high diversity; a low TTR corresponds to a more repetitive, less diverse sequence. In practical corpus linguistics, TTR is “trivial to compute and to interpret as ‘new types per token’” (Choi et al., 2023). However, because V(n)V(n) almost always grows sublinearly in nn, TTR is inherently length-dependent and typically declines as the sample size increases, necessitating careful normalization for meaningful comparisons (Choi et al., 2023, Rosillo-Rodes et al., 2024, Hidaka, 2013).

2. Relation to Zipf’s and Heaps’ Laws

The statistical behavior of TTR is grounded in empirical scaling laws observed in language and other complex systems:

Substituting Heaps’ law into the TTR formula yields TTR(n)=knβ1\mathrm{TTR}(n) = k n^{\beta-1}, which decays as nn increases since β<1\beta<1.

Exact Asymptotics. A deterministic asymptotic theory relates TTR precisely to the Zipf exponent α\alpha (Rosillo-Rodes et al., 3 Nov 2025):

  • For 0α<10 \leq \alpha < 1, TTR(N)(1α)1/(1α)Nα/(1α)TTR(N) \sim (1-\alpha)^{1/(1-\alpha)}\,N^{\alpha/(1-\alpha)} (growing as a power of NN).
  • For α=1\alpha = 1, TTR(N)1/[lnN+γln(lnN+γ)]TTR(N) \sim 1/[\ln N + \gamma - \ln(\ln N + \gamma)] (decaying logarithmically slowly).
  • For α>1\alpha > 1, TTR(N)1/ζ(α)TTR(N)\to 1/\zeta(\alpha), saturating to a nonzero constant for large NN.

This theoretical framework demonstrates that classic Heaps’ law scaling emerges as a consequence of Zipfian count statistics, with critical corrections at α=1\alpha=1 and in the supercritical regime α>1\alpha>1 (Rosillo-Rodes et al., 3 Nov 2025).

3. Statistical Properties and Occupancy Theory

Probabilistically, the general type–token model (occupancy, or species–sampling) treats nn i.i.d. draws from a population of NN types, each with probability pip_i, yielding VnV_n observed types (Hidaka, 2013). The expected value is

E[Vn]=i=1N[1(1pi)n],E[V_n] = \sum_{i=1}^{N} \left[1 - (1 - p_i)^n\right],

so that

E[TTR(n)]=1ni=1N[1(1pi)n],E[\mathrm{TTR}(n)] = \frac{1}{n}\sum_{i=1}^{N} [1-(1-p_i)^n],

which is strictly decreasing in nn unless the underlying type distribution is uniform. The variance can be explicitly computed, and the distribution of VnV_n admits both multinomial and inclusion–exclusion forms. These exact formulas underpin advanced maximum-likelihood estimators of total type richness, outperforming classical Good–Turing and Horvitz–Thompson estimators, especially under heavy-tailed (Zipf) distributions (Hidaka, 2013). Standardization by sample size or use of fitted parametric models is required when TTR is used to compare lexical richness across texts of different lengths or corpora.

4. Empirical Behavior: Language, Morphology, and Register

Studies on massive gigaword corpora in English, Spanish, and Turkish reveal systematic dependence of TTR on morphological typology, register, and genre (Rosillo-Rodes et al., 2024):

  • Agglutinative languages (e.g., Turkish) display higher V(n)V(n) and TTR at fixed nn due to increased word-form diversity.
  • Informal registers (e.g., tweets) have higher TTR than formal registers (books), reflecting neologisms, abbreviations, and greater variability.
  • At very large nn (e.g., n109n\sim10^9), TTR values are typically in the range 10310^{-3} to 10210^{-2}, decreasing from Turkish tweets (TTR7.39×103\approx7.39\times10^{-3}) to English books (TTR1.20×103\approx1.20\times10^{-3}).

Analytically, fitting TTR(n)cnδTTR(n)\sim c n^\delta yields δβ1\delta\approx\beta-1 with high stability (r2>0.98r^2>0.98); baseline TTR must always be interpreted in terms of language and register-specific morphological structure (Rosillo-Rodes et al., 2024).

5. Applications, Generalizations, and Adjusted Measures

TTR serves multiple roles:

  • Corpus Design: In Mongolian, extrapolating TTR values using empirically fit Heaps’ laws, a minimal “sufficient” corpus size was identified at which incremental TTR gain drops below a small threshold, yielding the first practical stopping rule for balanced corpus construction (Choi et al., 2023).
  • Model Selection and Filtering: In synthetic text generation, TTR is used for diversity filtering. However, TTR and even sophisticated variants such as MATTR or Compression Ratio exhibit “short-text bias”, with short texts acquiring spuriously high TTR (Deshpande et al., 20 Jul 2025).
  • Penalty-Adjusted TTR (PATTR): This refinement penalizes deviations from a target length LTL_T, computing PATTR(w;LT)=Vocab(w)/(w+wLT)PATTR(w;L_T)=|\mathrm{Vocab}(w)|/(|w|+|\,|w|-L_T|), and substantially reduces length bias in LLM output selection tasks (Deshpande et al., 20 Jul 2025).

TTR is also tightly linked to entropy-based measures: across languages and registers, word entropy HH exhibits a strong, empirically determined functional dependence on TTR, allowing the latter to serve as a low-cost proxy for entropy-based lexical diversity, particularly in large-scale settings (Rosillo-Rodes et al., 2024).

6. Critical Discussion and Best-Practice Recommendations

Several limitations and methodological caveats must be noted:

  • Length Dependence: Raw TTR always declines with nn; for comparative and inferential uses, texts must be subsampled or TTR must be length-normalized via regression modeling (Rosillo-Rodes et al., 2024, Choi et al., 2023).
  • Type Definition Sensitivity: TTR values are sensitive to tokenization and type definition (word form, lemma, phrase), as well as corpus domain composition (Choi et al., 2023).
  • Universal Curves and Data Collapse: For systems adhering to Zipf distributions, vocabulary growth and TTR exhibit universal data collapse governed by the Zipf exponent, not a single power-law as posited by naive Heaps’ models. Empirical results confirm that in real corpora, including highly bursty natural language, the polylog/Lerch transcendent formulas provide superior fits (Font-Clos et al., 2014).
  • Estimator Choice: For vocabulary-size estimation, Poisson–binomial maximum-likelihood estimators operating under flexible Zipf-Poisson priors show the lowest bias and variance among all tested approaches (Hidaka, 2013).
  • Reporting: Standard practice is to report TTR at fixed nn or to present model-based TTR(n)TTR(n) curves, not raw observed TTR at varied lengths (Hidaka, 2013).

7. Summary of Key Theoretical and Empirical Relations

In summary, while the raw TTR is simple and widely adopted, rigorous statistical modeling, adjustment for sample size, and context-sensitive interpretation are essential for its scientific use in both empirical and theoretical domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Type–Token Ratio (TTR).