Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pretraining Data Selection

Updated 20 May 2026
  • Pretraining data selection is the process of curating optimal data subsets from large corpora to enhance model performance and efficiency.
  • It employs methods like distributional similarity, classifier-based ranking, LM-based proxies, and influence functions to balance quality, diversity, and target alignment.
  • Empirical studies show that targeted data selection can reduce compute needs and training tokens while boosting downstream accuracy and domain specialization.

Pretraining data selection refers to the set of methodologies, algorithms, and empirical strategies designed to identify, rank, or subsample subsets from large unlabeled corpora for efficient and performant pretraining of neural models—including LLMs, domain-specific transformers, and self-supervised speech recognizers. The overarching goal is to optimize data use either for universal capabilities or for highly specialized downstream targets, subject to constraints imposed by compute, domain requirements, or data quality variability.

1. Fundamental Concepts and Objectives

Pretraining data selection tackles two intertwined challenges in large-scale model development: (a) the immense heterogeneity and noise present in web-scale or domain-generic corpora, and (b) the computational infeasibility of pretraining on all available data. The process is grounded in the empirically validated principle that data quality, domain/target alignment, and diversity in training corpora are critical determinants of resulting model capabilities, convergence speed, and sample efficiency (Mizrahi et al., 16 Jul 2025, Shum et al., 2 Mar 2025, Zhang et al., 2024).

Key goals include:

  • Maximizing downstream performance (e.g., accuracy, F1, BLEU, NPM) for a fixed computational budget by prioritizing data projected to yield the strongest impact on evaluation tasks.
  • Reducing token count or compute needs required to reach a given level of downstream performance.
  • Explicitly aligning pretraining data with anticipated downstream (target) distributions or benchmarks, sometimes to the level of target-specific ā€œexpertā€ models (Mizrahi et al., 16 Jul 2025, Wang et al., 17 Apr 2026).
  • Balancing coverage across quality, semantic, topical, or stylistic axes to avoid overfitting to highly correlated/ā€œredundantā€ modes, which is crucial for generalizing to a broad task suite (He et al., 21 Oct 2025, Liu et al., 23 Apr 2025).

2. Methodological Taxonomy

Modern research organizes data-selection methods into several distinct but sometimes overlapping families:

  1. Distributional Similarity and Importance Resampling: Selects examples whose empirical feature statistics (e.g., n-gram frequencies, neural embeddings) match or minimize divergence from a target distribution. DSIR (Xie et al., 2023) and GOT-D (Kang et al., 2024) instantiate this via importance resampling and optimal transport gradients, respectively.
  2. Classification and Domain Alignment: Employs discriminative classifiers trained on labeled or pseudo-labeled in-domain examples versus the general pool, assigning scores that reflect the probability of domain membership (Iter et al., 2021, Liu et al., 23 Apr 2025).
  3. Language-Model-Based Proxies: Measures per-sample likelihoods or perplexities under generic and in-domain LMs, using the difference (contrastive data selection), or filters directly by perplexity (Lu et al., 2022, Dai et al., 2020). Perplexity–performance correlations have also been used as rank-based selectors without explicit model training (Thrush et al., 2024).
  4. Influence Functions and Model-Driven Selection: Estimates the first-order effect of including a data instance on the loss over a reference task, using influence functions or finite difference approximations (Yu et al., 2024, Hao et al., 7 Oct 2025, Zhang et al., 2024).
  5. Policy Gradient and Mask Learning: Directly optimizes a selection mask via policy gradients using set-based reward functions that combine quality and diversity metrics, as in DATAMASK (Fan et al., 30 Dec 2025).
  6. Diversity-Aware and Orthogonalized Approaches: Explicitly decorrelates multiple quality or semantic dimensions via PCA or spectral decompositions to prevent reduced coverage and score collapse (He et al., 21 Oct 2025, Liu et al., 23 Apr 2025).
  7. Benchmark Alignment and Targeting: Scores candidate pretraining data by direct similarity to evaluation benchmarks (embedding-based), then learns lightweight scorers to propagate these scores for large-scale selection (Mizrahi et al., 16 Jul 2025, Wang et al., 17 Apr 2026).
  8. Attention- and Neuron-Centric Selection: Exploits neuron- or attention-head-level activation patterns, selecting data that maximally stimulates ā€œhigh-impactā€ or retrieval-related sub-networks within a pretrained LLM (Hua et al., 12 May 2025, Wang et al., 17 Apr 2026).

This methodological landscape continues to evolve, and recent frameworks increasingly combine several of these axes in unified selection recipes (Liu et al., 23 Apr 2025, Fan et al., 30 Dec 2025, Bai et al., 2024).

3. Quality, Diversity, and Target Alignment: Core Criteria

Selection criteria revolve around three central axes:

Criterion Operationalization Representative Methods
Data Quality Human/LLM scores, LM PPL, KL, GPT/LLM graders, FineWeb-Edu, DCLM
compression eff., influence DSIR, QuRating, PRESELECT, Quad
Diversity Clustering, low similarity, DATAMASK, QuaDMix, ODiS, Quad
PCA decorrelation, MABs
Target Alignment Domain LM similarity, BMK sim, CDS, Domain classifier, BETR, NAG
OT gradients, neuron-activation GOT-D, NAG, MASS, policy-gradient

Quality is quantified via linguistic attributes (coherence, grammar), factual or pedagogical content (knowledge richness, reasoning), model-based loss (perplexity or compression efficiency), and model-driven influence. Diversity is enforced through clustering and orthogonalization (PCA, spectral methods), probabilistic selection across clusters, or set-based submodular metrics. Target alignment relies on direct similarity to benchmarks, in-domain LMs or classifiers, optimal transport proximity, or neuron/attention patterns associated with target functionality.

A consistent finding is that naive top-k selection by any single score (e.g., highest overall ā€œqualityā€) often results in diminished generalization due to correlated metrics and loss of coverage—a non-monotonicity between raw score and downstream accuracy (He et al., 21 Oct 2025, Liu et al., 23 Apr 2025). To counteract, most modern systems use multi-dimensional or adaptive mechanisms to balance or orthogonalize the information selected for pretraining.

4. Algorithms and Practical Workflows

Data selection algorithms differ substantially in computational complexity, scalability, and modularity. Leading approaches include:

  • Hashed n-gram importance weights and resampling: O(N) but stateless—scales easily for web-scale data (Xie et al., 2023).
  • Contrastive language-model likelihoods: Lightweight (O(|q|)), can leverage n-gram or deep LMs depending on desired speed/precision (Lu et al., 2022, Dai et al., 2020).
  • Classifier-based selection: Requires training on labeled or pseudo-labeled in-domain and general samples; more precise but compute-demanding (Iter et al., 2021).
  • Influence-function estimation with K-FAC/efficient Hessian approximations: Feasible when fast inverse-HVP is available or can be hybridized with further pruning/clustering (Yu et al., 2024, Zhang et al., 2024).
  • Policy-gradient selection (e.g., DATAMASK): Formulate mask learning over the full data pool; employs variance-reduction, chunked subsetting, quality-aware initialization, and stochastic search for scalability (Fan et al., 30 Dec 2025).
  • Multi-actor, ensemble, or console-based methods: Multiple selectors (ā€œactorsā€) update in parallel, and an external ā€œconsoleā€ adaptively balances or reweights their scores in response to proxy model rewards (Bai et al., 2024).

These workflows are further stratified by where in the training pipeline selection is performed:

5. Empirical Impact, Scaling Laws, and Domain Specialization

Empirical studies across benchmarks and domains consistently demonstrate that carefully crafted data selection provides significant reductions in compute and/or pretraining tokens for a given performance level. Notable empirical results include:

  • Compute reduction: PRESELECT achieves equal or superior accuracy with 10Ɨ fewer tokens versus baseline pretraining on RefinedWeb (Shum et al., 2 Mar 2025). DATAMASK reduces selection time 98.9% over prior greedy methods and boosts downstream accuracy by +3.2–1.9 pp on dense/MoE models (Fan et al., 30 Dec 2025).
  • Domain adaptation: CuriĆ“-Edu 7B, continued-pretrained on only 10% ā€œeducational/STEMā€ data, outperforms a model trained on the entire 100B-token ClassiCC-PT corpus by +1.8 NPM points on 40+ Portuguese benchmarks, with quintuple compute reduction (Almeida et al., 14 Dec 2025).
  • Task-specific selection: MASS achieves >3%–5.9% absolute accuracy gains on math reasoning with only 30%–70% of the raw data, leveraging skill graphs (Li et al., 19 Mar 2025).
  • Generalist vs. Specialist tradeoff: BETR shows that aggressively matching to benchmarks increases accuracy on those tasks but may reduce unrelated capabilities, necessitating careful scale-adaptive filtering as model size grows (Mizrahi et al., 16 Jul 2025).

Scaling law analysis indicates that larger models benefit from less aggressive filtering, with the optimal retention fraction increasing as a power of compute (Mizrahi et al., 16 Jul 2025). In specialized settings, quality-centric selection can dramatically outperform pure volume scaling, especially when model capacity is adequate to leverage curated structure (Almeida et al., 14 Dec 2025).

6. Recent Innovations and Future Challenges

Recent trends and ongoing challenges include:

  • Unified and joint selection frameworks: Approaches like QuaDMix and DATAMASK fuse multiple quality and diversity criteria with unified parametric selectors or policy gradients, empirically outperforming independent or siloed methods (Liu et al., 23 Apr 2025, Fan et al., 30 Dec 2025).
  • Neuron- and attention-centric interpretability: Methods extracting interpretable ā€œfunctional backbonesā€ (e.g., NAG, AttentionInfluence) point toward mechanisms for aligning pretraining with task-critical circuitry in LLMs (Wang et al., 17 Apr 2026, Hua et al., 12 May 2025).
  • Model-aware/dynamic feedback: Algorithms that incorporate proxy or main model state, including dynamic multi-actor weighting or influence modeling, achieve greater efficiency and robustness over static selector pipelines (Bai et al., 2024, Yu et al., 2024).
  • Computational and statistical bottlenecks: Influence-based methods require approximations (K-FAC, sub-sampling) to avoid prohibitive compute, and continued research is needed on balancing accuracy/overhead for trillion-token regimes (Zhang et al., 2024, Hao et al., 7 Oct 2025).
  • Diversity–quality–budget tradeoffs: Optimal policies are context- and scale-sensitive; too much ā€œqualityā€ erodes diversity/generalization, while excessive diversity dilutes impact (He et al., 21 Oct 2025, Liu et al., 23 Apr 2025).
  • Applicability to emerging modalities: Extensions to multimodal, non-English, and specialized technical domains (code, math, biomedical) require bespoke selection metrics and more data-efficient abstractions (e.g., skill graphs for math) (Li et al., 19 Mar 2025, Almeida et al., 14 Dec 2025).

Major open questions persist around theoretical guarantees of selection optimality, dynamic mixture adaptation in continual learning, and full automation of data–task alignment across pipeline stages.

7. Summary Table of Key Methods

Method/Class Core Selection Approach Reference(s) Key Empirical Result
DSIR Importance resampling (n-grams/features) (Xie et al., 2023) +2%–2.5% GLUE gain, scales to 1B docs
CDS/Contrastive LM Cross-entropy diff of LM scores (Lu et al., 2022, Iter et al., 2021) –11.8% WER @6% data for ASR
Classifier/DC Domain classifier (logreg/BERT) (Iter et al., 2021) +0.3–0.6 BLEU (MT), avoids overfit
Influential Data Influence (iHVP, K-FAC) (Yu et al., 2024, Zhang et al., 2024) +1.39 pp, efficient UCB sampling
ODiS PCA on multi-quality, then top-k/PC (He et al., 21 Oct 2025) +2.8% absolute avg gain
DATAMASK Mask learning, joint quality+diversity (Fan et al., 30 Dec 2025) +3.2 (dense), +1.9 (MoE) points
QuaDMix Unified param. function: qual.+diversity (Liu et al., 23 Apr 2025) +7.2% avg across multi-benchmark
BETR Embedding-based sim. to benchmarks (Mizrahi et al., 16 Jul 2025) 2.1Ɨ compute efficiency
PRESELECT Compression-prediction score, fastText (Shum et al., 2 Mar 2025) 10Ɨ compute reduction, +3.1pp avg
MASS Skill-graph aggregation (math domain) (Li et al., 19 Mar 2025) +3.3–5.9 pp, 50%–70% fewer tokens
Curió-Edu 7B Classifier-based domain filtering (Almeida et al., 14 Dec 2025) +4.69 NPM, 20% compute vs. baseline
NAG, AttentionInfluence Neuron/attention impact-based ranking (Wang et al., 17 Apr 2026, Hua et al., 12 May 2025) +4.9–9% avg over random or baseline

In sum, pretraining data selection has emerged as a foundational component of LLM and domain-adaptive model pipelines. The current research landscape reveals a convergence on hybrid, interpretable, and compute-aware selection mechanisms that integrate quality, diversity, and explicit target alignment, substantially advancing the state of efficient large-scale model pretraining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pretraining Data Selection.