Sample-efficient Language Model Pretraining

Updated 19 November 2025

Sample-efficient pretraining is defined by methods that optimize data selection and dynamic curricula to achieve competitive LLM performance with fewer tokens.
Techniques such as SampleMix, perplexity-based sampling, and model-based filtering ensure high-quality input, yielding faster convergence and improved downstream accuracy.
Innovative architectures and adaptive learning schedules, including loss-based reweighting and curriculum-guided layer scaling, enable cost-effective training even in low-resource settings.

Sample-efficient LLM Pretraining refers to methods and principles that maximize the linguistic capabilities, generalization, and factual recall of LLMs given a sharply restricted data or compute budget—often 1–2 orders of magnitude below standard web-scale pretraining. The field encompasses data selection, dynamic reweighting, curriculum design, auxiliary objectives, architectural modifications, and efficient fine-tuning approaches. Contemporary research demonstrates that with judiciously curated corpora, targeted sampling/selection, and optimized training schedules and inductive biases, LLMs can match or outperform much larger models on a wide spectrum of tasks while using only a fraction of tokens or FLOPs.

1. Principles and Motivation for Sample-efficient Pretraining

Sample efficiency is motivated by the disparity between human language acquisition (∼10⁷–10⁸ words) and mainstream LLM training (∼10¹¹–10¹² tokens) (Warstadt et al., 2023, Warstadt et al., 10 Apr 2025, Hu et al., 2024). Constraints on data, compute, and energy require strategies that extract maximal value from limited training exposures, whether for small-scale developmental modeling (e.g., BabyLM) or cost-effective deployment in low-resource domains.

Key principles:

Data Quality and Diversity: High-quality and information-rich examples facilitate faster convergence and stronger generalization (Xi et al., 3 Mar 2025).
Task-aligned Composition: The choice and complexity of training data should match both the intended use and the model's parameterization (Yam et al., 2024).
Dynamic Focus: Adaptive weighting and revisiting of data based on evolving model performance optimizes learning (Prakriya et al., 2024, Sow et al., 10 Feb 2025).
Inductive Biases: Architectural choices (e.g., layer aggregation, gating, residual weighting) and auxiliary objectives can drive greater sample efficiency (Warstadt et al., 10 Apr 2025, Hu et al., 2024).

2. Data Selection, Mixing, and Filtering Strategies

Efficient data utilization is central to sample-efficient pretraining. Several paradigms have emerged:

SampleMix (Quality and Diversity-based Global Sampling): Each document receives a scalar score integrating clarity, coherence, style, credibility, significance, richness, and analytical depth (via GPT-4–labelled ordinal regression), plus an embedding-derived diversity score. Sampling weights $p(x)$ are computed as $p(x) = \text{softmax}(\alpha Q(x) + (1-\alpha) D(x) / \tau)$ , with $\alpha=0.8$ optimal (Xi et al., 3 Mar 2025). This bottom-up approach outperforms standard domain-wise mixing, yielding up to 2× faster convergence and improved perplexity and downstream accuracy.
Perplexity-based Sampling: Using lightweight n-gram models (KenLM), each document's perplexity is estimated and corpora are re-weighted to preferentially select mid-range (i.e., neither trivial nor noisy) samples via stepwise or gaussian weighting. Pre-training on Spanish mC4 with only 1/5th the data and half the steps matches or exceeds full-data RoBERTa baselines (Rosa et al., 2022).
Model-based Filtering (Multilingual): Binary classifiers (FastText or Transformer+MLP) trained on high-quality exemplars label crawl-scale documents, retaining only the top 10–20%. In 20 languages, models show no loss—and often improvement—in accuracy when trained on 15% of tokens (Messmer et al., 14 Feb 2025).
kNN Retrieval-augmented Expansion: Seed corpora are expanded by finding semantically similar documents in domain-related or in-domain pools via embedding-based kNN; "in-context" augmentation with retrieved text boosts domain adaptation efficiency (up to 85× corpus reduction, 4× GPU savings over standard DAPT) (Zhukova et al., 28 Apr 2025).
Granular n-gram Importance Sampling: Multi-granular features (subword, word, n-gram up to length 3–4) are extracted from both target and raw data distributions; importance weights select documents aligning to target features. Models pretrained on ∼1% of 70B-token RefinedWeb match or exceed full-data performance (Chang et al., 2024).

3. Dynamic Curriculum, Layer Scaling, and Reweighting Schedules

Adaptive pretraining regimes further optimize sample usage:

Learn–Focus–Review (LFR): Blocks of data are dynamically reprioritized based on block-wise perplexity; "Focus" epochs concentrate on high-perplexity blocks, regularly reintroducing all data ("Review") to avoid forgetting (Prakriya et al., 2024). LFR yields up to 20× fewer training iterations to reach baseline accuracy.
Instance-level Loss-based Reweighting: Within each minibatch, samples receive weights $w_i^t$ as functions of normalized loss, e.g., $w_i^t = \text{softmax}(s_i^t/r_t)$ , where $s_i^t$ is a score computed via LinUpper, Quadratic, or Extremes strategy. Downweighting low-loss "easy" items accelerates convergence and improves downstream accuracy for 7B-parameter Llama and GPT2 models while costing only $O(b)$ additional operations per batch (Sow et al., 10 Feb 2025).
Curriculum Learning and Layer Stacking: Corpora are ordered (or mixed) by difficulty signals—compression ratio, lexical diversity, readability, etc.—with pacing functions (linear, quadratic, interleaved) deciding the exposure schedule (Zhang et al., 12 Jun 2025). Curriculum-Guided Layer Scaling (CGLS) progressively increases model depth as data difficulty increases; stages couple expansion from $L_1$ to $L_K$ layers and bin-wise sampling, offering consistent 2–5% gains on PIQA, ARC, MMLU-STEM (Singh et al., 13 Jun 2025).

4. Architectural and Objective Innovations

Inductive architectural modifications substantially improve sample efficiency:

Layer Aggregation and Gated Attention: The BabyLM-winning architectures (LTG-BERT, ELC-BERT) incorporate disentangled attention and per-layer weighted aggregation, enabling stronger generalization under 100M-word budgets (Warstadt et al., 10 Apr 2025, Hu et al., 2024).
Hybrid Objectives (CLM/MLM): Simultaneous training on causal LM and masked LM objectives (e.g., GPT-BERT: $L_\text{total} = \alpha L_{\text{CLM}} + (1-\alpha) L_{\text{MLM}}$ with $\alpha\sim0.125$ ) improves performance across grammatical and pragmatic tasks (Hu et al., 2024).
Subnetwork Selection and Distillation: Evolutionary search identifies structurally sparse subnetwork initializations from large LLM weights, and knowledge distillation from teacher models accelerates convergence and generalization. Best candidate SLMs match Pythia validation perplexity using 9.2× fewer pretraining tokens (Krishnakumar et al., 8 Oct 2025).
Efficient Autoencoding Denoising: METRO incorporates an auxiliary generator, main model with post-LayerNorm, and joint RTD + simplified CLM objectives. Model-generated corruption acts as a self-curriculum, leading to SOTA on GLUE/SuperGLUE benchmarks with ≤50% of the compute (Bajaj et al., 2022).
Informativeness-aware Masking (Self-Evolution): Masked-LM pretraining prioritizes tokens with high prediction error (informative/neglected), and smooth labels interpolate between one-hot and model-predicted distributions, doubling per-token efficiency over vanilla masking (Zhong et al., 2022).

5. Quantitative Results, Empirical Findings, and Practical Guidelines

Empirical studies across tracks and scales consistently show substantial efficiency gains:

SampleMix: Achieves 47.77% downstream accuracy vs 46.4% for DoReMi, with nearly 2× faster convergence (Xi et al., 3 Mar 2025).
BabyLM Outcomes: With 100M words, ELC-BERT reaches aggregate scores (Agg ≈ 0.74) exceeding RoBERTa trained on original full-size corpora, approaching human-level generalization (Warstadt et al., 10 Apr 2025). Sentence-level units and shorter sequence lengths (32–64 tokens) further boost efficiency.
Curriculum Learning: Warmup strategies with CL yield up to +3.5% improvement and allow models to reach baseline peaks using 20–40% fewer tokens (Zhang et al., 12 Jun 2025).
Model-based Filtering: Multilingual LLMs match MMLU baselines with only 15% of tokens, generalizable across 20 languages (Messmer et al., 14 Feb 2025).
CGLS: Layer/depth scaling matched to curriculum stages leads to 2–5% higher zero-shot accuracy on QA and reasoning tasks at both 100M and 1B parameter scales (Singh et al., 13 Jun 2025).
Target-aware Sampling: Multi-granular n-gram importance sampling preserves generality and task performance at ~1% of full-corpus scale (Chang et al., 2024).

Best practices:

Align dataset complexity and diversity with model size and capacity (Yam et al., 2024).
Periodically reassess sampling weights when data pools change or deduplication occurs (Xi et al., 3 Mar 2025).
For stringent budgets, favor cognitively inspired, mixed-source corpora over brute scaling (Hu et al., 2024).
Implement layered curriculum schedules and monitor intermediate rare-fact learning metrics (WASB, $\alpha_m$ ) for early model selection (Christoph et al., 20 Jun 2025).
Tune architectural and masking hyperparameters specifically for small data regimes (Warstadt et al., 10 Apr 2025).
For efficient domain adaptation, bootstrap small seeds via kNN retrieval and in-context expansion (Zhukova et al., 28 Apr 2025).

6. Limitations, Controversies, and Open Challenges

Curriculum learning: Despite theoretical appeal, curriculum schedules based solely on data ordering/difficulty have shown only modest or inconsistent improvements except in conjunction with augmentation (Warstadt et al., 10 Apr 2025).
Domain and factual recall: Sample-efficient models are robust for high-frequency facts but show marked architectural and scale-related differences in rare fact acquisition; maximizing $\alpha_m$ is crucial (Christoph et al., 20 Jun 2025).
Multimodal integration: Image-text modeling remains a major open challenge, with no current sample-efficient methods outperforming standard baselines on visio-linguistic tasks (Hu et al., 2024).
Compute/budget trade-offs: Cognitively implausible epoch counts (hundreds–thousands) are sometimes required to approach human generalization; future challenges should target realistic compute schedules (Warstadt et al., 10 Apr 2025, Hu et al., 2024).
Bias and representation: Current filtering, sampling, and augmentation pipelines do not explicitly address fairness, toxicity, or demographic parity (Chang et al., 2024).

7. Future Directions and Research Opportunities

Prioritized areas for future advancement include:

Adaptive and instance-aware curricula: Integrate example-level loss, importance, and factual rarity for dynamic pacing (Sow et al., 10 Feb 2025, Singh et al., 13 Jun 2025).
Architectural innovation: Advance subnetwork extraction, aggregation, gating, and non-transformer backbone designs for optimal per-token value (Krishnakumar et al., 8 Oct 2025, Warstadt et al., 10 Apr 2025).
Efficient multimodal objectives: Develop sample-efficient recipes specifically for vision–language settings, targeting semantic/pragmatic transfer (Hu et al., 2024).
Low-resource adaptation: Expand retrieval and filtering paradigms to more languages/domains, possibly via cross-lingual embedding transfer (Messmer et al., 14 Feb 2025, Zhukova et al., 28 Apr 2025).
Data augmentation and synthetic expansion: Refine splicing, context mixing, and proxy-based expansion methods for sustainable diversity (Warstadt et al., 10 Apr 2025).
Factual knowledge-centric pretraining: Directly increase rare-fact recall using exposure-boosting or knowledge-targeted sampling (Christoph et al., 20 Jun 2025).

Sample-efficient LLM pretraining remains a fast-evolving field, with substantial scope for theoretical, empirical, and practical innovation. The core trajectory points toward architectures, objectives, and schedules that enable LLMs to achieve human-level linguistic competence with training budgets orders of magnitude below contemporary standards.