Sample-Efficient Pretraining Techniques

Updated 19 November 2025

Sample-efficient pretraining is a set of methods that reduce data and compute requirements while maintaining competitive performance through innovative sampling, curriculum design, and architectural optimizations.
Data-centric approaches like perplexity sampling and domain-relevance selection dramatically cut down pretraining tokens with minimal performance loss on complex tasks.
Incorporating curriculum learning, hybrid objectives, and subnetwork distillation accelerates convergence and enhances generalization across diverse language and vision models.

Sample-efficient pretraining refers to algorithmic and architectural techniques that enable neural models—most notably LLMs and vision networks—to achieve competitive performance while leveraging substantially less pretraining data, compute, or both, compared to conventional large-scale pretraining regimes. The field emerged in response to the observation that contemporary LLMs and vision models consume multiple orders of magnitude more data than human learners, yet their sample efficiency (the ratio of generalization performance to data consumed) remains suboptimal. Recent advances center on data filtering, instance-level and curriculum-based sampling, specialized architectures, hybrid objectives, subnetwork initialization, distillation, and efficient self-supervision.

1. Data-centric Techniques for Sample-efficient Pretraining

The optimization of the pretraining corpus—via quality-filtering, adaptive sampling, or relevance scoring—has yielded significant improvements in sample efficiency.

Perplexity-Based Sampling: BERTIN's perplexity sampling method scores each document $W$ in a web crawl using an external LLM (e.g., a KenLM 5-gram model trained on Wikipedia), computing document-perplexity as

$pp(W) = 10^{-\frac{1}{L}\sum_{i=1}^L \mathrm{KenLM}(W_i)}$

Documents with very high or low perplexity are downsampled, and intermediate-perplexity documents are upsampled via either piecewise or Gaussian-shaped probability distributions. This recovers or exceeds baseline performance using only one fifth of the data and half the training steps, as shown for a RoBERTa-Base Spanish model on NER and XNLI (Rosa et al., 2022).

Domain-relevance Sampling: Conditional pretraining via clustering or domain-classifier filtering selects a subset $D'_s\subset D_s$ of the source corpus that minimizes the representational or distributional distance to the target dataset $D_t$ . Clustering feature vectors of target data and selecting source images closest to cluster centers or leveraging a binary discriminator between target and source pool enables ImageNet-level transfer at $6\%$ – $12\%$ of the original ImageNet scale, with only marginal performance loss for classification, segmentation, and detection (Chakraborty et al., 2020).

Importance Sampling and Reweighting: Dynamic instance-level loss-based reweighting (Sow et al., 10 Feb 2025) replaces uniform minibatch updates with weighted gradients per batch:

$\theta^{t+1} = \theta^t - \eta_t \sum_{i\in B} w_i(t) \nabla f(x_i;\theta^t),\quad \sum_{i\in B} w_i(t) = 1$

The weights $w_i(t)\propto\exp(s_i/r_t)$ are computed based on normalized per-sample loss with temperature-scheduled shaping functions. Empirically, down-weighting low-loss (easy/redundant) samples accelerates convergence and improves downstream zero- and few-shot performance.

SwiftLearn Framework: This method prunes the training set to the most informative samples by measuring the per-sample logit change during initial epochs and retaining only the top $rN$ examples with highest mean-squared error across epochs. The active set is periodically refreshed to allow re-entry for samples whose importance grows. For BERT-GLUE finetuning, up to $90\%$ of data can be dropped with negligible accuracy loss and a $3$– $4 \times$ speedup (Hajimolahoseini et al., 2023).

2. Curriculum Learning and Adaptive Data Pacing

Curriculum learning mechanisms reorder or resample the training data such that the model initially sees easier (e.g., lower information density, higher readability) examples and later transitions to more difficult ones. Systematic evaluation in large-scale language modeling reveals several robust practical schemes (Zhang et al., 12 Jun 2025):

Difficulty Metrics: Compression Ratio, Lexical Diversity (MTLD), Readability (Flesch Reading Ease), sequence length, fertility, and model-predicted perplexity.
Vanilla CL: Train strictly from lowest to highest difficulty—delivering early and mid-training speed-ups and $18$– $28\%$ reduction in data required to reach baseline accuracy on MTLD and length.
Pacing-based Sampling: Approximate curricula via linear, quadratic, or inverse-quadratic allocation of data to difficulty bins, enabling flexible exposure balancing. Linear pacing is effective for redundancy/density; quadratic for length/readability.
Interleaved CL: Repeats exposure to all difficulty bins in uniform or shuffled order across training “interleaves,” maximizing continual learning benefits.
Practical Impact: CL as a warmup (20–30% budget) yields persistent accuracy gains of up to $3.5\%$ above random sampling that baseline training does not recover, even with double the tokens.
Purely static “easy-to-hard” curricula often fail to yield aggregate improvements unless paired with dynamic pacing, group sampling, or as a warmup phase.

3. Architectural and Objective Innovations

Sample efficiency is strongly impacted by both model structure and the training objective.

Hybrid MLM+CLM Objectives: Experiments from BabyLM (GPT-BERT) (Hu et al., 2024) and the LTG-BERT/ELC-BERT lineage (Warstadt et al., 10 Apr 2025) demonstrate that blending masked (bidirectional) and causal (autoregressive) language modeling with a mixing ratio ( $p\approx 0.125$ for CLM, $1-p$ for MLM) outperforms pure objectives, particularly in small-corpus settings ( $<$ 100M words). This hybrid loss is

$\mathcal{L}(\theta) = p \mathcal{L}_{\text{CLM}}(\theta) + (1-p) \mathcal{L}_{\text{MLM}}(\theta)$

with precise loss and data-mixing recipes. Gated attention and residual reweighting further sharpen signal propagation.

Denoising with Model-generated Signals (METRO): The METRO recipe combines an auxiliary masked LM, main discriminator, and a corrective LM loss on masked positions. The core replaced-token detection (RTD) task enables models like METRO-LM to reach benchmark performance with $2$– $3 \times$ fewer tokens than standard MLM, scaling up to 5.4B parameters (Bajaj et al., 2022). The total loss is

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{aux}} + \lambda \mathcal{L}_{\text{RTD}} + \mathcal{L}_{\text{s-clm}}$

with an optimal $\lambda=50$ .

Contrastive and Self-supervised Pretraining on Task-internal Data: For tasks with severe resource constraints, self-supervision via contrastive objectives on “task-internal” data (e.g., pseudo-labels mined from the same corpus) enables small CNN-based models to outperform or match RoBERTa pretraining on 1000 $\times$ more data on both zero-shot and few-shot downstream AP, especially in long-tail and minority-label generalization (Rethmeier et al., 2020).

Subnetwork Selection and Distillation: Partial parameter extraction from large LLM checkpoints (e.g., Pythia-6.9B) via evolutionary search for subnetwork architectures initializes small models closer to optimality, yielding $1.5$– $10\times$ reductions in token budgets for equivalent validation perplexity. Top-down knowledge distillation, blending hard and soft KL-divergence losses, further increases token efficiency and downstream accuracy for resource-constrained small LLMs (Krishnakumar et al., 8 Oct 2025).

4. Empirical Outcomes, Scaling Laws, and Evaluation Protocols

Systematic evaluations under standardized small-data challenges (BabyLM, both 2023 and 2024) provide comparative data on the absolute and relative performance of sample-efficient methodologies.

Leaderboards and Scaling: The best hybrid models (GPT-BERT, Boot-BERT, ELC-BERT) in the 100M-word track achieve BLIMP accuracy of $85$– $86\%$ , with downstream GLUE macro-averages of $81.5\%$ , outperforming pure encoder or decoder baselines, and matching Llama 2/OPT models trained on orders of magnitude more data (Hu et al., 2024, Warstadt et al., 10 Apr 2025). Under 10M-word caps, these methods maintain $>70\%$ macro-average. However, performance grows linearly with $\log_{10}$ (FLOPs), confirming compute-limited regimes (Pearson $r\approx 0.8$ ).

Evaluation Metrics: Aggregate evaluation includes syntactic (BLIMP), downstream NLU ((Super)GLUE), generalization (MSGS), vision-language (VQA, Winoground), and pragmatic/world knowledge (EWoK) dimensions. Perplexity, pseudo-perplexity (for masked LMs), F1/Accuracy on probes, and minimal-pair accuracy constitute core metrics.

Impact of Data Quality and Augmentation: Curated, child-directed and narrative-heavy text (CHILDES, children’s books) consistently drive stronger syntactic and pragmatic generalization than generic web text. Aggressive text augmentation in limits-constrained settings (e.g., randomized chunking, synthetic document inflation) also yields 2–4% lifts on challenging syntactic tasks (Warstadt et al., 10 Apr 2025).

5. Practical Guidelines and Recommendations

Drawing from cross-paper analyses:

Target Multi-objective Losses: Combine MLM, CLM, span modeling, and discriminative objectives for maximal data utility, exploiting complementary inductive biases (Hu et al., 2024, Bajaj et al., 2022).
Invest in Corpus Design: Domain and audience-focused corpora (child speech, dialogue, stories) outperform web-scale generic text under strict budgets (Hu et al., 2024, Warstadt et al., 10 Apr 2025).
Architectural Simplicity with Bias: Layer aggregation (ELC-style), GEGLU activations, and residual reweighting improve efficiency per parameter while stabilizing high-depth training.
Instance and Group-weighted Sampling: Per-sample or per-domain importance weighting, dynamic per-batch loss scaling, and rapid sample reintroduction maximize use of informative examples while filtering unhelpful redundancy (Sow et al., 10 Feb 2025, Hajimolahoseini et al., 2023).
CL as Incremental/Early-phase Tool: Curriculum orderings, especially as warmup, yield permanent gains with marginal overhead (Zhang et al., 12 Jun 2025).
Distillation and Subnetwork Initialization: Knowledge transfer from large LLMs to smaller ones via evolutionary mask search and temperature-tuned KL losses allows competitive performance at a fraction (down to $1/9$) of the baseline token and compute cost (Krishnakumar et al., 8 Oct 2025).
Monitoring Compute and Sample Efficiency: Report and monitor both wall-clock FLOPs and sample efficiency metrics $(\mathrm{EvalScore} - \mathrm{BaselineScore})/\log_{10}(\#\,\text{WordsSeen})$ to calibrate gains (Hu et al., 2024).

6. Limitations, Controversies, and Future Directions

Despite substantial progress, several challenges remain:

Curriculum learning and static data orderings show mixed results unless applied dynamically or as part-warmup; purely “easy-to-hard” data orders do not yield consistent improvement (Hu et al., 2024, Warstadt et al., 10 Apr 2025).
Instance weighting and importance sampling must be carefully tuned to avoid overfitting outliers, especially with aggressive data reduction.
Scaling to multimodal pretraining remains unsolved; no current strategies outperformed large vision-language baselines on joint text-image tasks under capped budgets (Hu et al., 2024).
Expanding to task-free or zero-shot regimes: Many subset selection strategies (domain-classifier, clustering) require a small in-domain or target dataset, limiting applicability in genuinely zero-knowledge settings (Chakraborty et al., 2020).
Automated hyperparameter and schedule selection: Efficient curricula, reweighting temperatures, and architectural hyperparameters often rely on hand-tuning or downstream validation.

Emerging research suggests that sample-efficient pretraining will require joint optimization across data, architecture, loss, and compute, along with robust, open evaluation protocols and cross-task generalization metrics.

7. Comparative Table of Sample-efficient Pretraining Methods

Method	Key Principle	Data Reduction Achieved
Perplexity Sampling (Rosa et al., 2022)	Up/down-bias moderate-perplexity text	1/5 total data, 1/2 steps
Loss-based Instance Reweighting (Sow et al., 10 Feb 2025)	Dynamic per-sample weighting	20–40% tokens pruned, $\lesssim$ 1% loss
SwiftLearn (Hajimolahoseini et al., 2023)	Drop low-importance samples	Up to 90% drop, 3× speedup
Curriculum Learning (Zhang et al., 12 Jun 2025)	Difficulty-ordered, paced sampling	20–40% tokens to baseline
METRO (Denoising/RTD) (Bajaj et al., 2022)	Model-generated sampling/objectives	2–3× tokens saved vs MLM
Subnetwork + Distillation (Krishnakumar et al., 8 Oct 2025)	LLM weight-greedy subnet init, KD	Up to 9.2× fewer tokens