TTR: Lexical Diversity & Corpus Analysis

Updated 13 March 2026

Type–Token Ratio (TTR) is a measure of lexical diversity defined as the ratio of unique word types to the total tokens in a text.
It applies mathematical formulations and empirical laws like Zipf's and Heaps' to model vocabulary growth and repetition patterns.
TTR informs practical applications in corpus design, synthetic text filtering, and entropy estimation, while requiring normalization due to length dependence.

The Type–Token Ratio (TTR) is a fundamental metric in quantitative linguistics, information theory, and statistical analysis of complex systems, widely used to quantify lexical diversity and vocabulary growth in large-scale corpora. Defined as the quotient of the number of distinct types (unique word forms or elements) and the total number of tokens (elemental instances), TTR encapsulates the degree of repetition or novelty present in a sequence. Its theoretical and applied significance is closely linked to statistical laws such as Zipf's law and Heaps' law, and its properties have motivated advanced generalizations and rigorous statistical modeling in diverse contexts (Rosillo-Rodes et al., 2024, Hidaka, 2013, Rosillo-Rodes et al., 3 Nov 2025, Font-Clos et al., 2014).

1. Fundamental Definitions and Mathematical Formulation

Let $n$ denote the total number of tokens in a text or sequence, and let $V(n)$ be the number of distinct types observed among those tokens. The Type–Token Ratio is defined as

$\mathrm{TTR}(n) = \frac{V(n)}{n}\,.$

A high TTR indicates low repetition and high diversity; a low TTR corresponds to a more repetitive, less diverse sequence. In practical corpus linguistics, TTR is “trivial to compute and to interpret as ‘new types per token’” (Choi et al., 2023). However, because $V(n)$ almost always grows sublinearly in $n$ , TTR is inherently length-dependent and typically declines as the sample size increases, necessitating careful normalization for meaningful comparisons (Choi et al., 2023, Rosillo-Rodes et al., 2024, Hidaka, 2013).

2. Relation to Zipf’s and Heaps’ Laws

The statistical behavior of TTR is grounded in empirical scaling laws observed in language and other complex systems:

Zipf’s Law: The frequency of the $r$ th most common type is inversely proportional to its rank: $f(r) \propto r^{-\alpha}$ . For natural language, $\alpha$ is typically close to $1$ (Rosillo-Rodes et al., 2024, Rosillo-Rodes et al., 3 Nov 2025).
Heaps’ Law: Vocabulary size $V(n)$ grows as a power law in $n$ : $V(n) = k\,n^\beta$ , where $0<\beta<1$ (Rosillo-Rodes et al., 2024, Choi et al., 2023).

Substituting Heaps’ law into the TTR formula yields $\mathrm{TTR}(n) = k n^{\beta-1}$ , which decays as $n$ increases since $\beta<1$ .

Exact Asymptotics. A deterministic asymptotic theory relates TTR precisely to the Zipf exponent $\alpha$ (Rosillo-Rodes et al., 3 Nov 2025):

For $0 \leq \alpha < 1$ , $TTR(N) \sim (1-\alpha)^{1/(1-\alpha)}\,N^{\alpha/(1-\alpha)}$ (growing as a power of $N$ ).
For $\alpha = 1$ , $TTR(N) \sim 1/[\ln N + \gamma - \ln(\ln N + \gamma)]$ (decaying logarithmically slowly).
For $\alpha > 1$ , $TTR(N)\to 1/\zeta(\alpha)$ , saturating to a nonzero constant for large $N$ .

This theoretical framework demonstrates that classic Heaps’ law scaling emerges as a consequence of Zipfian count statistics, with critical corrections at $\alpha=1$ and in the supercritical regime $\alpha>1$ (Rosillo-Rodes et al., 3 Nov 2025).

3. Statistical Properties and Occupancy Theory

Probabilistically, the general type–token model (occupancy, or species–sampling) treats $n$ i.i.d. draws from a population of $N$ types, each with probability $p_i$ , yielding $V_n$ observed types (Hidaka, 2013). The expected value is

$E[V_n] = \sum_{i=1}^{N} \left[1 - (1 - p_i)^n\right],$

so that

$E[\mathrm{TTR}(n)] = \frac{1}{n}\sum_{i=1}^{N} [1-(1-p_i)^n],$

which is strictly decreasing in $n$ unless the underlying type distribution is uniform. The variance can be explicitly computed, and the distribution of $V_n$ admits both multinomial and inclusion–exclusion forms. These exact formulas underpin advanced maximum-likelihood estimators of total type richness, outperforming classical Good–Turing and Horvitz–Thompson estimators, especially under heavy-tailed (Zipf) distributions (Hidaka, 2013). Standardization by sample size or use of fitted parametric models is required when TTR is used to compare lexical richness across texts of different lengths or corpora.

4. Empirical Behavior: Language, Morphology, and Register

Studies on massive gigaword corpora in English, Spanish, and Turkish reveal systematic dependence of TTR on morphological typology, register, and genre (Rosillo-Rodes et al., 2024):

Agglutinative languages (e.g., Turkish) display higher $V(n)$ and TTR at fixed $n$ due to increased word-form diversity.
Informal registers (e.g., tweets) have higher TTR than formal registers (books), reflecting neologisms, abbreviations, and greater variability.
At very large $n$ (e.g., $n\sim10^9$ ), TTR values are typically in the range $10^{-3}$ to $10^{-2}$ , decreasing from Turkish tweets (TTR $\approx7.39\times10^{-3}$ ) to English books (TTR $\approx1.20\times10^{-3}$ ).

Analytically, fitting $TTR(n)\sim c n^\delta$ yields $\delta\approx\beta-1$ with high stability ( $r^2>0.98$ ); baseline TTR must always be interpreted in terms of language and register-specific morphological structure (Rosillo-Rodes et al., 2024).

5. Applications, Generalizations, and Adjusted Measures

TTR serves multiple roles:

Corpus Design: In Mongolian, extrapolating TTR values using empirically fit Heaps’ laws, a minimal “sufficient” corpus size was identified at which incremental TTR gain drops below a small threshold, yielding the first practical stopping rule for balanced corpus construction (Choi et al., 2023).
Model Selection and Filtering: In synthetic text generation, TTR is used for diversity filtering. However, TTR and even sophisticated variants such as MATTR or Compression Ratio exhibit “short-text bias”, with short texts acquiring spuriously high TTR (Deshpande et al., 20 Jul 2025).
Penalty-Adjusted TTR (PATTR): This refinement penalizes deviations from a target length $L_T$ , computing $PATTR(w;L_T)=|\mathrm{Vocab}(w)|/(|w|+|\,|w|-L_T|)$ , and substantially reduces length bias in LLM output selection tasks (Deshpande et al., 20 Jul 2025).

TTR is also tightly linked to entropy-based measures: across languages and registers, word entropy $H$ exhibits a strong, empirically determined functional dependence on TTR, allowing the latter to serve as a low-cost proxy for entropy-based lexical diversity, particularly in large-scale settings (Rosillo-Rodes et al., 2024).

6. Critical Discussion and Best-Practice Recommendations

Several limitations and methodological caveats must be noted:

Length Dependence: Raw TTR always declines with $n$ ; for comparative and inferential uses, texts must be subsampled or TTR must be length-normalized via regression modeling (Rosillo-Rodes et al., 2024, Choi et al., 2023).
Type Definition Sensitivity: TTR values are sensitive to tokenization and type definition (word form, lemma, phrase), as well as corpus domain composition (Choi et al., 2023).
Universal Curves and Data Collapse: For systems adhering to Zipf distributions, vocabulary growth and TTR exhibit universal data collapse governed by the Zipf exponent, not a single power-law as posited by naive Heaps’ models. Empirical results confirm that in real corpora, including highly bursty natural language, the polylog/Lerch transcendent formulas provide superior fits (Font-Clos et al., 2014).
Estimator Choice: For vocabulary-size estimation, Poisson–binomial maximum-likelihood estimators operating under flexible Zipf-Poisson priors show the lowest bias and variance among all tested approaches (Hidaka, 2013).
Reporting: Standard practice is to report TTR at fixed $n$ or to present model-based $TTR(n)$ curves, not raw observed TTR at varied lengths (Hidaka, 2013).

7. Summary of Key Theoretical and Empirical Relations

TTR measures lexical diversity as $V(n)/n$ , is analytically tractable, and is strongly governed by Zipf/Heaps statistical structure (Rosillo-Rodes et al., 3 Nov 2025, Rosillo-Rodes et al., 2024, Font-Clos et al., 2014).
For large-scale data, TTR vs. $n$ follows well-predicted, language- and register-specific scaling laws, with explicit closed-form corrections beyond naive power-law heuristics (Rosillo-Rodes et al., 3 Nov 2025, Font-Clos et al., 2014).
Advanced estimation and adjustment—both at the level of corpus linguistics (for corpus size optimization (Choi et al., 2023)) and in synthetic data filtering (for controlling length-bias (Deshpande et al., 20 Jul 2025))—are necessary for robust, interpretable use.
Morphological typology and register effect size are nontrivial; cross-linguistic and cross-register comparisons require model-based standardization (Rosillo-Rodes et al., 2024).
Asymptotically, the theoretical link between TTR and entropy enables unified analysis of lexical diversity, making TTR a useful proxy metric under proper methodological controls (Rosillo-Rodes et al., 2024).

In summary, while the raw TTR is simple and widely adopted, rigorous statistical modeling, adjustment for sample size, and context-sensitive interpretation are essential for its scientific use in both empirical and theoretical domains.

Markdown Report Issue Upgrade to Chat

References (6)

Entropy and type-token ratio in gigaword corpora (2024)

General Type Token Distribution (2013)

Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings (2025)

Log-log Convexity of Type-Token Growth in Zipf's Systems (2014)

A Study on the Appropriate size of the Mongolian general corpus (2023)

A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Type–Token Ratio (TTR).

TTR: Lexical Diversity & Corpus Analysis

1. Fundamental Definitions and Mathematical Formulation

2. Relation to Zipf’s and Heaps’ Laws

3. Statistical Properties and Occupancy Theory

4. Empirical Behavior: Language, Morphology, and Register

5. Applications, Generalizations, and Adjusted Measures

6. Critical Discussion and Best-Practice Recommendations

7. Summary of Key Theoretical and Empirical Relations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TTR: Lexical Diversity & Corpus Analysis

1. Fundamental Definitions and Mathematical Formulation

2. Relation to Zipf’s and Heaps’ Laws

3. Statistical Properties and Occupancy Theory

4. Empirical Behavior: Language, Morphology, and Register

5. Applications, Generalizations, and Adjusted Measures

6. Critical Discussion and Best-Practice Recommendations

7. Summary of Key Theoretical and Empirical Relations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research