Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Pre-Finetuning Data Screening

Updated 1 August 2025
  • Pre-finetuning data screening is a set of techniques that refine pre-training data by selecting high-quality, task-aligned samples.
  • The process involves methods like label-based, similarity-based, and optimal transport selection to reduce noise and mismatch.
  • Empirical results indicate that effective screening significantly improves model generalization while reducing training time and resource usage.

Pre-finetuning data screening refers to the set of methods and criteria used to select, curate, or refine data prior to fine-tuning a pre-trained model on a downstream task. The goal of these procedures is to maximize fine-tuning efficiency and downstream generalization by ensuring that only the most relevant, high-quality, and task-aligned data instances are used during the adaptation stage. Pre-finetuning data screening has become increasingly vital as model sizes and data volumes grow, as transfer learning applications expand to low-data or out-of-distribution settings, and as data-driven model bias and redundancy issues emerge.

1. Theoretical Foundations for Data Reuse and Screening

The principal theoretical motivation for pre-finetuning data screening derives from generalization bounds—specifically, excess risk analysis. When a model is initialized from pre-trained weights and subsequently fine-tuned on task-specific data, the downstream generalization gap F(θ)F(θ)F(\theta) - F(\theta^*) can be decomposed into terms reflecting the distance between the pre-training and target distributions as well as the volume/quality of data used in both stages (Liu et al., 2021).

Let F(θ)=E(x,y)P[f(θ;x,y)]F(\theta) = \mathbb{E}_{(x,y)\sim\mathcal{P}}[f(\theta; x, y)] denote the target loss and G(θ)=E(x,y)Q[g(θ;x,y)]G(\theta) = \mathbb{E}_{(x',y')\sim\mathcal{Q}}[g(\theta; x', y')] the pre-training loss, with the pre-training gradient mismatch bounded by Δ\Delta:

F(θ)G(θ)Δ, θ\|\nabla F(\theta) - \nabla G(\theta)\| \leq \Delta, \forall\ \theta

Excess risk bounds show that:

  • The benefit of initialization from pre-trained weights is roughly O(Δ2)O(\Delta^2).
  • Fine-tuning on nn target samples yields an excess risk diminishing as O(log(nΔ2)/n)O(\log(n\Delta^2) / n); the benefit of pre-training thus weakens as the target sample size increases or if distributions diverge substantially.

Including select pre-training samples during fine-tuning improves the excess risk bound, introducing a term δ2\delta^2 that captures the attainable match between pre-training and target-task gradients. By minimizing δ2\delta^2 via screening, the model's final generalization can be optimally improved. This justifies not only reusing, but actively screening and weighting pre-training samples according to their alignment with the downstream task (Liu et al., 2021).

2. Methods for Screening and Curating Pre-training Data

Screening approaches span simple label-based inclusion to advanced distributional matching. Key methods include:

  • Label-based Selection (if labels/classes overlap): Directly reusing data from pre-training classes that coincide with the target task. Effective when clear semantic overlap exists (Liu et al., 2021).
  • Random Selection: Sampling a random subset of pre-training data, a naive baseline that does not account for domain differences.
  • Similarity-based Selection: The preeminent approach involves computing feature means (e.g., penultimate-layer activations from a pre-trained backbone) for classes/clusters in both target and pre-training sets, then measuring similarity via cosine or L2L_2 distance.
  • Unbalanced Optimal Transport (UOT) Selection: Formally, an unbalanced OT problem is solved to align target and source data distributions—minimizing

minP0P,Cϵh(P)+τ1KL(P1,w(g))+τ2KL(PT1,w(f))\min_{P \geq 0} \langle P, C \rangle - \epsilon h(P) + \tau_1 \mathrm{KL}(P \mathbf{1}, w^{(g)}) + \tau_2 \mathrm{KL}(P^T \mathbf{1}, w^{(f)})

Here CC quantifies pairwise costs between feature means, PP is a transport matrix, and τ1,τ2\tau_1, \tau_2 control marginal penalties. By selecting pre-training samples/classes with high aggregate transport mass, only the most similar pre-training data is reused (Liu et al., 2021).

  • Distribution-shifting Selection for LLMs: In language domains, optimal transport gradients—the optimal data points to shift the model's effective distribution toward the target (via one-shot OT dual gradient computation)—deliver high-efficiency screening at scale (Kang et al., 5 May 2024). This method selects samples in the massive, unlabeled pool which, when used for pre-finetuning, optimally nudge the pretraining distribution in the desired direction.

A comparative summary:

Screening Method Principle Key Application
Label-based Overlapping classes or labels Natural/classification domains
Random Unbiased sampling; no domain knowledge Baseline
Similarity-based/UOT Feature/prototype similarity; OT alignment Vision, classification, LLMs
OT Gradient (LLM) Minimize effective distributional gap Large-scale language

3. Impact of Data Quality, Domain Match, and Curation

The effect of screening is highly dependent on several factors:

  • Domain Overlap and Divergence: When the gap between pre-training and target data is large, reuse of mismatched pre-training data can degrade generalization due to a dominant δ2\delta^2 term in excess risk (Liu et al., 2021). Curation—either by label or via OT-based screening—substantially reduces δ2\delta^2, allowing only domain-aligned samples to be included.
  • Noise and Data Volume: The effectiveness of screening is magnified as the quality of pre-training data worsens. For example, using 2000×\times more noisy data (LAION) is required to match the transfer ability of well-curated data (supervised ImageNet), but screened/curated subsets attain optimality much faster and with less waste (Entezari et al., 2023). Quality filtering, clustering, and confidence-based methods systematically remove noisy or detrimental data (Longpre et al., 2023, Chen et al., 19 Mar 2024).
  • Low-data Regimes: In limited-sample settings, careful data selection is critical as inappropriate reuse of pre-training data otherwise causes generalization collapse (Liu et al., 2021, Entezari et al., 2023). Conversely, as nn \to \infty for target data, the benefit of pre-training reuse diminishes.
  • Data Redundancy and Diversity: Shapley value–based refinement and quality-proxy clustering ensure that only the most diverse, high-contribution samples (by empirical or value-function improvement) are included (He et al., 23 Apr 2024). This avoids redundancy, noise, and overfitting associated with overly large, unscreened datasets.

4. Practical Algorithms and Computational Considerations

Efficient implementation depends on screening method and scale. Representative techniques include:

  • Feature Extraction and Clustering: For vision tasks, compute deep representations for all data, then cluster (e.g., K-means) for prototype construction.
  • Optimal Transport Computation: Employ entropy-regularized Sinkhorn or similar solvers on class/cluster mean summaries. For very large data (LLMs), a single OT computation suffices; the OT dual form yields gradient-based sample selection with high efficiency in JAX or similar frameworks (Kang et al., 5 May 2024).
  • Dynamic Data Pruning: For NLP classification, dynamic EL2N-based example scoring allows pruning during training, reducing resource requirements up to 41–66% and maintaining accuracy within 1% of full-data fine-tuning (Attendu et al., 2023).
  • Quality/Confidence Screening: Use self-consistency and NLI-derived confidence metrics for each (prompt, response) sample. Filtering based on thresholded confidence removes noisy samples, with candidate correction steps for borderline cases (Chen et al., 19 Mar 2024).
  • Shapley-based Proxy Sampling: Use embedding-based clustering to select proxies, estimate each proxy's contribution via Shapley value (by repeated ablation or score difference), then sample clusters proportional to their estimated impact for final fine-tuning (He et al., 23 Apr 2024).

These pipelines are scalable to millions of instances and, depending on method, can be executed on a single modern GPU.

5. Empirical Outcomes and Generalization Gains

Empirical evaluation demonstrates:

  • Robust Performance Gains: On vision benchmarks (Dogs, Cars, CUB, etc.), UOT-based screening yields an average accuracy increase of ~2.93% over standard fine-tuning, with especially pronounced lift in scarce-data scenarios (Liu et al., 2021).
  • Noise Mitigation and Compression: Only 10% of original data selected with SHED can match or outperform full datasets—indicating extensive redundancy and the criticality of impactful screening (He et al., 23 Apr 2024). Confidence-based curation pipelines (e.g., CLEAR) improve structured output correctness and accuracy by >15 percentage points in some LLM settings (Chen et al., 19 Mar 2024).
  • Resource Efficiencies: Dynamic data pruning achieves up to 66% reduction in fine-tuning computation time with marginal or no accuracy degradation (Attendu et al., 2023). Distribution-shifting OT-gradient screening for LLM pre-finetuning processes millions of candidate samples within minutes and consistently improves zero-shot, domain adaptation, and detoxification performance (Kang et al., 5 May 2024).
  • Domain and Task Generalization: Transferability of screened subsets (e.g., SHED) has been validated across model architectures (e.g., LLaMA, Vicuna, GPT-2) and targets, with performance robustness for both NLU and generation downstream tasks (He et al., 23 Apr 2024, Kang et al., 5 May 2024).

6. Limitations, Challenges, and Best Practices

Current screening methods have several limitations and open issues:

  • Diversity and Representativeness: Clustering and proxy methods, if not carefully tuned, may overlook rare but important cases and reduce data diversity.
  • Objective and Bias: Value functions based solely on accuracy might not capture fairness or other domain desiderata, risking bias propagation.
  • Parameter Sensitivity: Methods relying on OT or Shapley value approximations often require hyperparameter adjustment (e.g., cluster numbers, entropy regularization, thresholding).
  • Scalability: Although computationally efficient relative to brute-force approaches, massive datasets may still pose practical limits for feature extraction and proxy definition.
  • Generalizability: Most theoretical results and empirical validations are shown for specific domains (vision, NLU); applicability to highly open-ended domains or arbitrary LLM tasks requires additional validation and evaluation.

Established best practices include aligning screening criteria with downstream task properties, validating data selection through both domain-agnostic metrics (accuracy, F1, BLEU) and task-specific objectives (e.g., semantic adherence, reduced bias), and iterative refinement via automated or semi-automated pipelines (Longpre et al., 2023, Liu et al., 2021, Chen et al., 19 Mar 2024).

7. Significance in Model Development Pipelines and Future Research

Pre-finetuning data screening has major implications for modern model development:

  • Resource Optimization: Selecting high-impact, low-noise examples amplifies accuracy per labeled datum and dramatically lowers computation/training burden.
  • Generalization and Robustness: Models trained on screened, task-aligned data generalize better—especially under data scarcity, distribution shift, or noise.
  • Ethical and Fair Model Behavior: Screening data with respect to bias, underrepresentation, toxicity, or domain appropriateness mitigates negative downstream behaviors and supports responsible AI deployment (Wang et al., 2023, Longpre et al., 2023).
  • Automated, Data-centric AI: Emerging paradigms (e.g., CLEAR, SHED) decouple data improvement from model architecture, enabling model-agnostic improvements directly through data pipeline refinement.

Ongoing research targets automated screening under partial labels, objective-aware proxy design, joint optimization of data and model parameters, and formal guarantees on downstream transferability and fairness properties. The increasing scale of model training, the prevalence of domain adaptation, and regulatory pressures around model provenance and data use all reinforce the centrality of pre-finetuning data screening in contemporary machine learning practice.