Textual Frequency Law (TFL)
- Textual Frequency Law is a scaling principle governing type frequency distributions in texts, revealing length-invariant patterns and unifying classical Zipf and Heaps laws.
- It demonstrates that, when normalized, frequency distributions across texts collapse onto a universal scaling function, providing precise predictive and diagnostic tools.
- The framework employs robust statistical models and empirical validations across languages and systems, and it has practical applications in corpus analysis and large language model training.
The Textual Frequency Law (TFL) is a unifying scaling principle governing the distribution of type frequencies (words, tokens, or more general expression units) in human and artificial language texts. TFL posits that, under appropriate normalization, the frequency distributions of types in texts of different lengths or granularities collapse to a length-invariant form; the apparent “power laws” observed in word and type statistics are specific limiting cases or components of this broader regularity. TFL subsumes classical Zipf and Heaps laws, extends to various language systems (including code and character-based scripts), and provides both predictive and diagnostic tools for corpus analysis, language modeling, and natural language processing.
1. Scaling Formulation and Mathematical Statement
The foundational principle of TFL is that, for a homogeneous text or corpus of length , the type-frequency distribution admits the scaling ansatz
where is the count of a type, the number of distinct types, and an -independent scaling function determined by the global properties of the corpus (Font-Clos et al., 2013, Corral et al., 2018). Equivalently, for relative frequency : Upon plotting versus for various 0, data from all sub-texts or sizes collapse onto the curve 1—a nontrivial statement of scale invariance.
For lemmatized texts, 2 typically exhibits a double power-law form: 3 with exponents 4–5 (high-frequency tail) and 6 (low-frequency regime). This induces corresponding asymptotic regimes for 7:
- 8 for large 9,
- 0 for small 1.
Such scaling holds for both raw tokenizations and lemmatized units, with the specific 2 reflecting the degree of morphological aggregation (Corral et al., 2014).
2. Relationship to Zipf’s and Heaps' Laws
TFL formally unifies Zipf's law for rank–frequency relations and Heaps’ empirical vocabulary growth law, historically considered independent phenomena:
- Zipf’s Law: In rank–frequency form, the normalized frequency 3 of the type ranked 4 follows
5
with 6, typically valid for 7 (Allahverdyan et al., 2013, Moreno-Sánchez et al., 2015, Deng et al., 2013). The scaling function 8 reduces to a single power law 9 with 0 in the classical regime.
- Heaps' Law: TFL predicts the vocabulary size 1 is determined by the integral of 2: 3 which, for 4, yields 5. The double power-law form implies a crossover: 6 for small 7 and 8 for large 9, consistently observed in real texts (Font-Clos et al., 2013, Corral et al., 2018).
Table: Parameter Regimes of the TFL Scaling Function
| Regime | Scaling of 0 | Implication |
|---|---|---|
| 1 | 2 | Low-frequency: 3 |
| 4 | 5 | High-frequency: 6 |
Simple power-law fits ignoring the existence of the crossover are misleading when analyzing real natural language corpora (Font-Clos et al., 2013, Moreno-Sánchez et al., 2015, Corral et al., 2014).
3. Statistical and Theoretical Underpinnings
The TFL scaling arises naturally from both probabilistic models and nonparametric scaling arguments:
- Bayesian Latent-variable Model: Assuming a multinomial word-drawing process with an inverse-square prior on type probabilities
7
where 8 is a regularization constant and 9 the vocabulary size, one obtains the generalized Zipf law
0
with corrections reproducing both the cutoff for frequent types and the hapax legomena tail. This prior reflects efficient, "mental lexicon" organization and is invariant under multiplicative preference updates (Allahverdyan et al., 2013, Deng et al., 2013).
- Finite-size Scaling: The scaling form 1 can be derived using generalized central-limit theorem arguments for heavy-tailed distributions with exponent 2: under these assumptions, the vocabulary grows as 3, and the full frequency distribution at any 4 is simply a rescaled version of 5 (Corral et al., 2018).
- Random Book Transformation (RBT): Real texts' frequency distributions for arbitrary sections can be constructed exactly by sampling from an underlying "meta-book" distribution through the RBT matrix, implying that the functional shape is a prediction of the scaling law rather than a pure power law (0906.0716).
4. Empirical Validation and Linguistic Universality
- Robust Data Collapse: Direct empirical evidence from long single-author texts in multiple languages shows that rescaled frequency distributions at varying 6 superimpose, strongly supporting the scaling hypothesis (Font-Clos et al., 2013, Corral et al., 2018).
- Exponent Stability: In extensive large-scale analysis of English texts (Project Gutenberg, 7), fitting the CCDF of type frequencies with 8 yields clear support for 9 across many datasets, confirming exponent universality under the TFL framework (Moreno-Sánchez et al., 2015).
- Morphological Level Invariance: TFL exponents are stable across both word forms and lemmatized units. Comparison across 10 major novels in English, Spanish, French, and Finnish reveals only small, systematic increases in low-frequency cutoffs and minor exponent drift after lemmatization; the core scaling remains robust (Corral et al., 2014).
- Multiple Language Systems: The scaling law applies to Chinese character frequencies, with short texts following a Zipf regime and long texts displaying an additional exponential decay region for less frequent types—a hierarchic two-layer structure explained by TFL extensions (Deng et al., 2013).
5. Generalizations, Special Cases, and Extensions
- Artificial Code and Benford Co-occurrence: In artificial languages (Java, C++), the rank-frequency scaling persists but exponents are markedly steeper (up to 0 in long code), and a Benford-like law simultaneously governs the distribution of leading digits in type frequencies; both signatures are highly robust to frequency outlier removal. This dual pattern is interpreted as a deeper, unifying statistical signature of linguistic systems, natural or artificial (Shulzinger et al., 2018).
- Narrative Hierarchy and Text Segmentation: In cohesive, meaningful texts, the TFL exhibits non-stationarity: when texts are split into halves, the onset rank of the Zipfian regime occurs earlier and with more homogeneous spatial distribution in the first half, correlating with thematic introduction and information flow. Random text shuffling or synthetic bag-of-words texts fail to reproduce these systematic differences (Deng et al., 2018).
- Critical Phenomena Analogy: TFL extends to the statistics of inter-appearance gaps for words, with the gap distribution for frequency 1 and gap length 2 scaling as 3 with universal 4, supporting an analogy to universality classes and correlation lengths in critical phenomena (0901.2924).
6. Applications in Language Modeling and LLM Training
- Textual Frequency in LLMs: TFL principles have recently been applied to prompt engineering and curriculum design in LLMs. Empirical studies show that high-frequency paraphrases of prompts consistently yield better downstream performance in tasks ranging from math reasoning and machine translation to commonsense QA. Sentence-level frequency estimates, calculated as the geometric mean of constituent word frequencies, can be used for input selection, and fine-tuning on frequency-sorted curricula (low-to-high) yields significant gains (Lu et al., 2 Apr 2026).
- Curriculum and Paraphrase Selection: The TFL framework supports paraphrase selection by maximizing estimated frequency and suggests Textual Frequency Distillation methods for refining frequency estimates using LLM-generated continuations. Such data-centric approaches demonstrate that textual frequency, rather than syntactic complexity, is the principal driver of LLM response quality, independently validated across multiple models and languages.
7. Controversies, Limitations, and Open Directions
- Exponent Drift and Scaling Validity: Claims that Zipf exponents drift systematically with text length have been attributed to model-fitting artifacts and failure to rescale frequency distributions. Proper scaling collapses reveal exponent constancy within the fit regime for broad classes of texts (Corral et al., 2018, 0906.0716).
- Cutoff and Finite-size Effects: The onset and extent of Zipfian scaling are sensitive to data size, corpus segmentation, and lemmatization. TFL prescribes correction for these effects by explicit dependence on both 5 and 6, yielding more accurate and transferable quantitative statements than naive power laws (Font-Clos et al., 2013, Corral et al., 2014).
- Generality and Meta-book Hypothesis: The random book transformation, and the TFL more widely, imply that author- or genre-specific "meta-book" distributions underlie observed type frequency regularities, with true universality restricted to scaling function shape rather than exponent value (0906.0716).
- Unresolved Questions: Quantitative connections between TFL, semantic information structure, and hierarchical text organization remain an active area. Improved models for the dynamic evolution of frequency distributions in interactive or adaptive systems, and extensions to sentence or document-level distributions, require further development (Deng et al., 2018, Lu et al., 2 Apr 2026).
References:
- (Font-Clos et al., 2013, Corral et al., 2014, Corral et al., 2018, Moreno-Sánchez et al., 2015, Allahverdyan et al., 2013, Deng et al., 2013, 0906.0716, Shulzinger et al., 2018, 0901.2924, Deng et al., 2018, Lu et al., 2 Apr 2026)