Textual Frequency Distillation (TFD)
- Textual Frequency Distillation (TFD) is a method that distills an LLM’s internal sentence frequency through model-generated story completions to overcome discrepancies with open-corpus data.
- It fuses traditional frequency estimates with LLM-derived metrics using tunable hyperparameters, thereby creating model-aware and adaptive metrics.
- TFD supports applications like prompt selection and curriculum fine-tuning, driving significant performance gains in translation, reasoning, and other NLP tasks.
Textual Frequency Distillation (TFD) is a central mechanism within the Textual Frequency Law (TFL) framework, developed to estimate and utilize sentence-level frequency information in LLMs whose pretraining corpora are proprietary. TFD refines raw, off-the-shelf frequency counts into model-aware metrics by using story completions generated directly by the LLM of interest. This process enables the calibration of frequency measures to reflect the “knowledge” and token distribution internal to closed-source LLMs and thereby improves both prompting and fine-tuning strategies across diverse NLP tasks (Lu et al., 2 Apr 2026).
1. Definition and Role within the Textual Frequency Law
TFD is the process by which an LLM’s sense of sentence frequency is distilled through interaction. Standard sentence frequency estimates, , computed from an open-domain corpus may be insufficient due to mismatch with the (often undisclosed) data used to train modern LLMs. TFD addresses this limitation by prompting the LLM to generate extended stories for each candidate sentence, synthesizing a distilled corpus, . A new frequency estimate, , is computed based on this LLM-generated data.
The resulting frequency metric,
integrates both the open-corpus-based and the LLM-distilled estimates, with , , and as tunable hyperparameters. This fused measure is used within TFL to select among paraphrases and to drive the curriculum in fine-tuning setups.
2. Theoretical Motivation and Objectives
TFD is motivated by the premise that an LLM’s ability to naturally extend an input sentence is correlated with its exposure to similar examples during pretraining. If a sentence was frequent in an LLM’s actual training data, the model should be more capable of reliably generating coherent continuations that preserve or reuse key vocabulary and phrasing. TFD thus serves the following objectives:
- Domain Adaptivity: Mitigates bias in due to discrepancies between open corpora and the LLM’s training data by producing a frequency estimate endogenous to the model.
- Model Alignment: Amplifies scores for sentences recognized as “common” by the model, while down-weighting rare or unfamiliar constructions.
- Zero-Frequency Recovery: Substitutes reliable frequency estimates for sentences unseen in the open corpus by leveraging , with the hyperparameter boosting confidence in such cases (Lu et al., 2 Apr 2026).
3. Algorithmic Structure and Implementation
The TFD algorithm comprises four key steps:
- Off-the-Shelf Frequency Calculation: Compute initial sentence frequencies, 0, from an open-domain corpus, using the geometric mean of inverse word frequencies:
1
where 2.
- Distilled Corpus Generation: For each sentence 3 in the dataset 4, issue a story-completion prompt to the LLM. The returned stories are parsed into sentences, producing the synthetic corpus 5.
- Distilled Frequency Computation: Compute 6 identically to 7 but based on unigram counts from 8.
- Frequency Fusion and Storage: Integrate the two frequency measures to obtain 9, which is stored for downstream use. Efficiency considerations include batching queries, truncating excessively long outputs, reusing stored frequencies, and consistent use of the wordfreq library.
Pseudocode Sketch:
2
4. Mathematical Formalization
TFD’s frequency metrics can be summarized as follows:
- Sentence-level frequency in corpus 0:
1
- Distilled frequency from 2:
3
- Final combined frequency:
4
These formulations are specifically used to re-rank candidate paraphrases (prompt selection) and to design fine-tuning curricula (see next section).
5. Practical Integration and Application
After TFD-derived 5 values are computed, two principal applications follow within the TFL pipeline:
- Prompt Selection: For a paraphrase set 6, select the sentence 7 that maximizes 8 as the LLM prompt.
- Curriculum Textual Frequency Training (CTFT): Fine-tune the LLM with the dataset sorted by increasing 9 at each epoch, ensuring the model sees lower-frequency sentences before higher-frequency ones.
Table: Summary of TFD Workflow
| Step | Input | Output |
|---|---|---|
| Off-the-shelf freq. (0) | Open corpus 1, dataset 2 | Initial frequencies |
| Story completion | Dataset 3, LLM | Distilled corpus 4 |
| Distilled freq. (5) | 6, dataset 7 | Model-aware frequencies |
| Fusion and application | 8, 9 | Final 0 for selection/training |
Implementation practices such as batching LLM queries, managing API rate limits, and disk-based storage of synthesized corpora and frequency statistics are recommended for scalability and reproducibility.
6. Empirical Impact and Evaluation
Empirical results confirm the centrality of TFD to the overall effectiveness of TFL strategies (Lu et al., 2 Apr 2026):
- Mathematical Reasoning (GSM8K): Accuracy improvements with TFL+TFD include GPT-4o-mini (60.7% → 68.7%), DeepSeek-V3 (63.6% → 71.5%), LLaMA-3.3-70B (80.5% → 88.8%).
- Machine Translation (100 Languages): BLEU, chrF, and COMET metrics all improve in 91–100% of language directions using high-frequency paraphrases from TFD.
- Commonsense Reasoning (CommonsenseQA): Accuracy gains of 2–3 points across multiple LLMs.
- Agentic Tool Calling: Tool selection accuracy and correctness improve by 3–6 points.
- Fine-Tuning (CTFT using TFD): Up to +30% relative BLEU improvement on low-resource translation directions compared to standard methods.
Ablation studies reveal that omitting TFD (i.e., using only 1) consistently degrades performance, especially in machine translation evaluated with COMET, establishing the indispensability of the TFD component.
7. Significance within Model Understanding and Data Selection
TFD provides a robust methodology for inferring model-sensitive frequency statistics when direct access to training data is unavailable. By constructing a synthetic, model-generated corpus that reflects the LLM’s internal distribution, TFD enables nuanced selection and ordering of data for downstream tasks. This mechanism, by bridging the gap between external corpus statistics and the model’s latent knowledge, is critical for enhancing both prompt engineering and systematic fine-tuning, underpinning the performance improvements described in empirical evaluations. The architecture and workflow of TFD thus constitute a significant methodological advance in aligning data selection strategies with the idiosyncrasies of large, closed-source LLMs (Lu et al., 2 Apr 2026).