Curriculum Textual Frequency Training (CTFT)

Updated 2 July 2026

CTFT is a supervised training paradigm that sequences examples by their textual frequency, presenting easy (high-frequency) to hard (low-frequency) instances.
CTFT improves optimization stability and generalization, yielding up to 30% BLEU and chrF score gains in language model fine-tuning.
CTFT is implemented by sorting training data using scalar frequency scores and adaptive pacing strategies, making it effective for both pretraining and fine-tuning.

Curriculum Textual Frequency Training (CTFT) is a supervised training paradigm in which training examples are presented to a model according to a time-varying schedule induced by their textual frequency. The core principle is to sequence fine-tuning or pretraining data in an “easy-to-hard” order defined by statistical frequency measures—such as unigram or geometric sentence-level frequencies—so that the model encounters high-frequency (easier) examples first, and lower-frequency (harder, tail) examples later. CTFT spans applications from sequence learning in cognitively-plausible neural network models to large-scale LLM pretraining and adaptation, and is empirically validated to stabilize optimization and improve generalization, especially in capacity-constrained regimes (Sen et al., 2020, Elgaar et al., 29 Jan 2026, Lu et al., 2 Apr 2026).

1. Formal Definition and Frequency Metrics

CTFT requires a per-sample scalar frequency score; all curricula constructions and pacing policies rest on this signal.

Word Frequency (LLM Pretraining):

For a sample $z$ of length $N$ (e.g., a 2048-token slice), the word-level frequency score is

$\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$

where $\mathrm{Zipf}(w)$ is the log $_{10}$ -frequency per billion words (SUBTLEX-US or equivalent) (Elgaar et al., 29 Jan 2026).

Sentence Frequency (Supervised Fine-tuning):

For a sentence $x = (x_1, ..., x_K)$ , the geometric mean frequency is

$\mathrm{sfreq}(x ; D) = \left( \prod_{k=1}^K w\mathrm{freq}(x_k, D) \right)^{1/K}$

where $w\mathrm{freq}(x_k, D)$ is the (unigram or bigram) token frequency drawn from a reference corpus $D$ (Lu et al., 2 Apr 2026).

Time-Varying Distribution (Combinatorial Curriculum):

For sequences of training steps $t = 0, ..., T-1$ , the sampling distribution is a convex combination of endpoint distributions $N$ 0:

$N$ 1

allowing curriculum pacing to evolve linearly over time (Sen et al., 2020).

2. Algorithmic Construction and Pacing Strategies

CTFT is implemented by sorting the training set according to the computed frequency scores and imposing deterministic or probabilistic pacing on batch selection:

Preprocessing: Compute scalar frequency for each sample using the chosen metric (Zipf-average for pretraining, geometric mean for sentence-level finetuning).
Sorting: Arrange all $N$ 2 samples from easiest (highest frequency) to hardest (lowest frequency).
Pacing Policy:
- Linear exposure: At normalized training step $N$ 3, allow sampling of subsets up to quantile $N$ 4 of the ordered list.
- Full sorting: For fine-tuning, iterate over the entire dataset sorted by frequency each epoch (Lu et al., 2 Apr 2026).
Batch Selection: Draw batches uniformly from the current allowed prefix or follow a fully sorted one-pass sequence.
Hyperparameters: LLM fine-tuning typically uses AdamW ( $N$ 5, cosine decay, 10\% warmup), batch size $N$ 6, 10 epochs; neural network teaching tasks use sequence lengths $N$ 7 (Sen et al., 2020, Lu et al., 2 Apr 2026).

CTFT is situated among several curriculum learning approaches:

Curriculum	Ordering Signal	Reference Papers
Random/Uniform	None	(Sen et al., 2020, Elgaar et al., 29 Jan 2026, Lu et al., 2 Apr 2026)
Frequency-based	Static freq. (unigram/Zipf)	(Sen et al., 2020, Lu et al., 2 Apr 2026, Elgaar et al., 29 Jan 2026)
Age-of-Acquisition	Avg. human AoA per word	(Sen et al., 2020, Elgaar et al., 29 Jan 2026)
Verb Variation	Verb class/type count per sample	(Elgaar et al., 29 Jan 2026)
Dependency-Tree Depth	Parse depth as example difficulty	(Lu et al., 2 Apr 2026)
Reverse CTFT	Hard-to-easy (high-to-low freq.)	(Lu et al., 2 Apr 2026)

CTFT is empirically superior to static reweighting and other easy-to-hard baselines; reverse (hard-to-easy) orderings consistently underperform.

4. Theoretical Foundations and Optimization Dynamics

CTFT's effectiveness is theoretically supported by gradient-variance control in SGD optimization for overparameterized neural models:

Gradient Noise Reduction: Initial training on high-frequency (easy) data leads to lower gradient noise scale $N$ 8, as rare words produce more stochastic gradients (Elgaar et al., 29 Jan 2026).
Spectral Stability: CTFT delays singular entropy collapse $N$ 9 in model output heads, mitigating softmax bottleneck saturation in late training.
Phase Exposure: CTFT alters exposure within underlying learning phases—extending critical phase access to easy data—without introducing new optimization phases (verified by joint HMM latent-phase analysis).
Variance Bounds: For piecewise population splits $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 0, curriculum pacing controls the effective variance, ensuring

$\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 1

so long as the proportion of "hard" samples is tightly managed.

5. Empirical Results and Performance Impact

The benefits of CTFT in a range of domains are supported by controlled experiments:

LLM Fine-Tuning (Qwen2.5-7B-Instruct, TFPD Corpus): On machine translation, CTFT yields up to +20–30% higher BLEU and chrF scores over high-frequency selection without curriculum ordering (Lu et al., 2 Apr 2026). Math reasoning and commonsense tasks show +5–8 point accuracy gains.
Pretraining (Pythia Models, 14M–410M): Averaged downstream accuracy improvements: $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 9 Gains diminish at scale; at 410M parameters, random ordering matches CTFT (Elgaar et al., 29 Jan 2026).
Supervised Machine Teaching Tasks (Monosyllabic Word Reading): Time-varying frequency curricula surpass static frequency, AoA, and random baselines. Reported held-out accuracies reach 96% for adult-corpus, $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 2 training pool (Sen et al., 2020).
Ablations: Removing frequency-based curriculum or reversing its order consistently reduces downstream task accuracy, spectral stability, and efficiency (Lu et al., 2 Apr 2026, Elgaar et al., 29 Jan 2026).

6. Implementation Guidelines and Practical Insights

Robust implementation of CTFT relies on precise metric computation, sorted data streaming, and phase-aware pacing:

Metric Selection: Use up-to-date Zipf or wordfreq statistics for the corpus at hand; for sentence-level fine-tuning, the geometric mean is preferred.
Data Handling: For large-scale distributed training, group sorted samples into $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 3 quantile bins (e.g., $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 4– $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 5) and sequence bins deterministically (Elgaar et al., 29 Jan 2026).
Phase Adaptation: Optionally, monitoring singular entropy or gradient noise can trigger adaptive pacing.
Cost: Frequency computation overhead is negligible compared to training; the primary challenge is sorting/preprocessing.
Generalization: CTFT is complementary to syntactic-complexity-based or AoA curricula, as frequency correlates only weakly with those measures (Pearson $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 6– $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 7 with sample loss) (Elgaar et al., 29 Jan 2026, Lu et al., 2 Apr 2026).
Model Scale Sensitivity: Gains are highest for memory- or compute-constrained models (sub-160M parameters); larger models exhibit diminishing returns due to the softmax bottleneck effect (Elgaar et al., 29 Jan 2026).

7. Extensions and Limitations

CTFT is adaptable and extensible:

Curriculum Mixing: Time-varying mixing of multiple frequency endpoints ( $\mathrm{Score}_{\mathrm{freq}}(z) = \frac{1}{N} \sum_{i=1}^N \mathrm{Zipf}(w_i)$ 8) via linear schedules or more breakpoints generalizes the scheduling for complex domains (Sen et al., 2020).
Domain Adaptation: Augmenting the frequency signal with domain- or language-specific norms enables cross-lingual or multimodal curriculum learning (Elgaar et al., 29 Jan 2026).
Limitations: For very large models (>410M parameters), CTFT’s optimization advantages—gradient variance reduction and spectral stability—are marginal, and random ordering suffices (Elgaar et al., 29 Jan 2026). A plausible implication is that as model capacity increases, the benefits of curriculum-based optimization diminish due to intrinsic regularization ("softmax bottleneck eases").
Complementarity: Because textual frequency is only partially aligned with lexical or syntactic complexity, CTFT can be combined with curricula based on other difficulty signals to balance stability and diversity (Lu et al., 2 Apr 2026).

CTFT provides a principled framework for data pacing in both cognitive modeling and large-scale language learning, translating psycholinguistic insights into practical machine learning improvements. Its methodological simplicity—frequency computation, sorting, and sequential exposure—facilitates adoption and experimental analysis across modalities and scales.

(Sen et al., 2020): https://arxiv.org/abs/([2006.16470](/papers/2006.16470), Elgaar et al., 29 Jan 2026): https://arxiv.org/abs/([2601.21698](/papers/2601.21698), Lu et al., 2 Apr 2026): https://arxiv.org/abs/([2604.02176](/papers/2604.02176))