Papers
Topics
Authors
Recent
Search
2000 character limit reached

Textual Frequency Law (TFL)

Updated 2 July 2026
  • Textual Frequency Law is a scaling principle governing type frequency distributions in texts, revealing length-invariant patterns and unifying classical Zipf and Heaps laws.
  • It demonstrates that, when normalized, frequency distributions across texts collapse onto a universal scaling function, providing precise predictive and diagnostic tools.
  • The framework employs robust statistical models and empirical validations across languages and systems, and it has practical applications in corpus analysis and large language model training.

The Textual Frequency Law (TFL) is a unifying scaling principle governing the distribution of type frequencies (words, tokens, or more general expression units) in human and artificial language texts. TFL posits that, under appropriate normalization, the frequency distributions of types in texts of different lengths or granularities collapse to a length-invariant form; the apparent “power laws” observed in word and type statistics are specific limiting cases or components of this broader regularity. TFL subsumes classical Zipf and Heaps laws, extends to various language systems (including code and character-based scripts), and provides both predictive and diagnostic tools for corpus analysis, language modeling, and natural language processing.

1. Scaling Formulation and Mathematical Statement

The foundational principle of TFL is that, for a homogeneous text or corpus of length LL, the type-frequency distribution admits the scaling ansatz

DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},

where nn is the count of a type, VLV_L the number of distinct types, and g()g(\cdot) an LL-independent scaling function determined by the global properties of the corpus (Font-Clos et al., 2013, Corral et al., 2018). Equivalently, for relative frequency f=n/Lf = n/L: DL(n)dn=[g(f)/VL]df.D_L(n) \, dn = [g(f)/V_L]\,df. Upon plotting LVLDL(n)L V_L D_L(n) versus n/Ln/L for various DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},0, data from all sub-texts or sizes collapse onto the curve DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},1—a nontrivial statement of scale invariance.

For lemmatized texts, DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},2 typically exhibits a double power-law form: DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},3 with exponents DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},4–DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},5 (high-frequency tail) and DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},6 (low-frequency regime). This induces corresponding asymptotic regimes for DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},7:

  • DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},8 for large DL(n)=g(n/L)LVL,D_L(n) = \frac{g(n/L)}{L \, V_L},9,
  • nn0 for small nn1.

Such scaling holds for both raw tokenizations and lemmatized units, with the specific nn2 reflecting the degree of morphological aggregation (Corral et al., 2014).

2. Relationship to Zipf’s and Heaps' Laws

TFL formally unifies Zipf's law for rank–frequency relations and Heaps’ empirical vocabulary growth law, historically considered independent phenomena:

  • Zipf’s Law: In rank–frequency form, the normalized frequency nn3 of the type ranked nn4 follows

nn5

with nn6, typically valid for nn7 (Allahverdyan et al., 2013, Moreno-Sánchez et al., 2015, Deng et al., 2013). The scaling function nn8 reduces to a single power law nn9 with VLV_L0 in the classical regime.

  • Heaps' Law: TFL predicts the vocabulary size VLV_L1 is determined by the integral of VLV_L2: VLV_L3 which, for VLV_L4, yields VLV_L5. The double power-law form implies a crossover: VLV_L6 for small VLV_L7 and VLV_L8 for large VLV_L9, consistently observed in real texts (Font-Clos et al., 2013, Corral et al., 2018).

Table: Parameter Regimes of the TFL Scaling Function

Regime Scaling of g()g(\cdot)0 Implication
g()g(\cdot)1 g()g(\cdot)2 Low-frequency: g()g(\cdot)3
g()g(\cdot)4 g()g(\cdot)5 High-frequency: g()g(\cdot)6

Simple power-law fits ignoring the existence of the crossover are misleading when analyzing real natural language corpora (Font-Clos et al., 2013, Moreno-Sánchez et al., 2015, Corral et al., 2014).

3. Statistical and Theoretical Underpinnings

The TFL scaling arises naturally from both probabilistic models and nonparametric scaling arguments:

  • Bayesian Latent-variable Model: Assuming a multinomial word-drawing process with an inverse-square prior on type probabilities

g()g(\cdot)7

where g()g(\cdot)8 is a regularization constant and g()g(\cdot)9 the vocabulary size, one obtains the generalized Zipf law

LL0

with corrections reproducing both the cutoff for frequent types and the hapax legomena tail. This prior reflects efficient, "mental lexicon" organization and is invariant under multiplicative preference updates (Allahverdyan et al., 2013, Deng et al., 2013).

  • Finite-size Scaling: The scaling form LL1 can be derived using generalized central-limit theorem arguments for heavy-tailed distributions with exponent LL2: under these assumptions, the vocabulary grows as LL3, and the full frequency distribution at any LL4 is simply a rescaled version of LL5 (Corral et al., 2018).
  • Random Book Transformation (RBT): Real texts' frequency distributions for arbitrary sections can be constructed exactly by sampling from an underlying "meta-book" distribution through the RBT matrix, implying that the functional shape is a prediction of the scaling law rather than a pure power law (0906.0716).

4. Empirical Validation and Linguistic Universality

  • Robust Data Collapse: Direct empirical evidence from long single-author texts in multiple languages shows that rescaled frequency distributions at varying LL6 superimpose, strongly supporting the scaling hypothesis (Font-Clos et al., 2013, Corral et al., 2018).
  • Exponent Stability: In extensive large-scale analysis of English texts (Project Gutenberg, LL7), fitting the CCDF of type frequencies with LL8 yields clear support for LL9 across many datasets, confirming exponent universality under the TFL framework (Moreno-Sánchez et al., 2015).
  • Morphological Level Invariance: TFL exponents are stable across both word forms and lemmatized units. Comparison across 10 major novels in English, Spanish, French, and Finnish reveals only small, systematic increases in low-frequency cutoffs and minor exponent drift after lemmatization; the core scaling remains robust (Corral et al., 2014).
  • Multiple Language Systems: The scaling law applies to Chinese character frequencies, with short texts following a Zipf regime and long texts displaying an additional exponential decay region for less frequent types—a hierarchic two-layer structure explained by TFL extensions (Deng et al., 2013).

5. Generalizations, Special Cases, and Extensions

  • Artificial Code and Benford Co-occurrence: In artificial languages (Java, C++), the rank-frequency scaling persists but exponents are markedly steeper (up to f=n/Lf = n/L0 in long code), and a Benford-like law simultaneously governs the distribution of leading digits in type frequencies; both signatures are highly robust to frequency outlier removal. This dual pattern is interpreted as a deeper, unifying statistical signature of linguistic systems, natural or artificial (Shulzinger et al., 2018).
  • Narrative Hierarchy and Text Segmentation: In cohesive, meaningful texts, the TFL exhibits non-stationarity: when texts are split into halves, the onset rank of the Zipfian regime occurs earlier and with more homogeneous spatial distribution in the first half, correlating with thematic introduction and information flow. Random text shuffling or synthetic bag-of-words texts fail to reproduce these systematic differences (Deng et al., 2018).
  • Critical Phenomena Analogy: TFL extends to the statistics of inter-appearance gaps for words, with the gap distribution for frequency f=n/Lf = n/L1 and gap length f=n/Lf = n/L2 scaling as f=n/Lf = n/L3 with universal f=n/Lf = n/L4, supporting an analogy to universality classes and correlation lengths in critical phenomena (0901.2924).

6. Applications in Language Modeling and LLM Training

  • Textual Frequency in LLMs: TFL principles have recently been applied to prompt engineering and curriculum design in LLMs. Empirical studies show that high-frequency paraphrases of prompts consistently yield better downstream performance in tasks ranging from math reasoning and machine translation to commonsense QA. Sentence-level frequency estimates, calculated as the geometric mean of constituent word frequencies, can be used for input selection, and fine-tuning on frequency-sorted curricula (low-to-high) yields significant gains (Lu et al., 2 Apr 2026).
  • Curriculum and Paraphrase Selection: The TFL framework supports paraphrase selection by maximizing estimated frequency and suggests Textual Frequency Distillation methods for refining frequency estimates using LLM-generated continuations. Such data-centric approaches demonstrate that textual frequency, rather than syntactic complexity, is the principal driver of LLM response quality, independently validated across multiple models and languages.

7. Controversies, Limitations, and Open Directions

  • Exponent Drift and Scaling Validity: Claims that Zipf exponents drift systematically with text length have been attributed to model-fitting artifacts and failure to rescale frequency distributions. Proper scaling collapses reveal exponent constancy within the fit regime for broad classes of texts (Corral et al., 2018, 0906.0716).
  • Cutoff and Finite-size Effects: The onset and extent of Zipfian scaling are sensitive to data size, corpus segmentation, and lemmatization. TFL prescribes correction for these effects by explicit dependence on both f=n/Lf = n/L5 and f=n/Lf = n/L6, yielding more accurate and transferable quantitative statements than naive power laws (Font-Clos et al., 2013, Corral et al., 2014).
  • Generality and Meta-book Hypothesis: The random book transformation, and the TFL more widely, imply that author- or genre-specific "meta-book" distributions underlie observed type frequency regularities, with true universality restricted to scaling function shape rather than exponent value (0906.0716).
  • Unresolved Questions: Quantitative connections between TFL, semantic information structure, and hierarchical text organization remain an active area. Improved models for the dynamic evolution of frequency distributions in interactive or adaptive systems, and extensions to sentence or document-level distributions, require further development (Deng et al., 2018, Lu et al., 2 Apr 2026).

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Textual Frequency Law (TFL).