Data-Efficient Pre-training

Updated 1 September 2025

Data-efficient pre-training is a strategy that optimizes data selection, learning objectives, and model architecture to reduce training resources while maintaining robust performance.
Techniques like contrastive self-supervision, multi-view learning, and knowledge distillation extract rich supervisory signals from limited, high-quality samples.
Practical implementations demonstrate up to 99% performance retention using reduced datasets, enabling sustainable and domain-adaptive model pre-training.

Data-efficient pre-training refers to methods and strategies that enable neural models to achieve high downstream or transfer performance while minimizing the volume of data, computational cost, or annotation effort required during pre-training. This paradigm has grown in importance given the escalating scale and environmental costs of “conventional” large-model pre-training, and is especially crucial in problem domains or modalities where data collection is expensive, sensitive, or intrinsically scarce.

1. Underlying Concepts and Motivations

The canonical approach in contemporary deep learning is to pre-train large models on massive “task-external” or generic datasets (e.g., 160GB of text for RoBERTa, 400M image-text pairs for CLIP). While effective, this method faces several issues:

Prohibitive data and computation requirements.
Diminishing returns due to data redundancy or poorly informative samples.
Poor fit for low-resource domains or rare classes (“long-tail” settings).

Data-efficient pre-training aims to optimize learning under such constraints by increasing the “informativeness” per training sample, either by modifying learning objectives, curating or selecting data more effectively, or improving the learning architecture.

2. Methodological Advances

A variety of techniques have been developed for data efficiency across tasks and modalities:

2.1 Contrastive Self-Supervision

CLESS (“Contrastive Learning-Data Efficient Self-Supervision”) reframes NLP pre-training from token prediction to measuring similarity between dense text and (pseudo-)label embeddings in a shared space. This approach, rooted in noise contrastive estimation (NCE) rather than softmax/token-level prediction, enables models to efficiently mine supervision signals even from limited, domain-specific (task-internal) data (Rethmeier et al., 2020).

2.2 Enhanced Supervision Signal and Multi-View Learning

DeCLIP extends vision-language contrastive learning by synergistically incorporating:

Self-supervision within each modality (SimSiam and Masked Language Modeling losses).
Cross-modal “multi-view” supervision (augmenting both image and text, producing additional alignment tasks).
Nearest-neighbor supervision (introducing semantic signals from neighboring, but not identical, pairs). These techniques yield stronger representations and transferability using 7.1x less data compared to baseline CLIP (Li et al., 2021).

2.3 Knowledge Distillation with Feature Alignment

KDEP leverages existing, robustly pre-trained models as teachers, directly distilling their feature space onto smaller or new student models. The approach aligns final-layer features using Singular Value Decomposition (SVD) and a Power Temperature Scaling (PTS) function for channel variance balance, allowing effcient transfer of representational power using only 10% of labeled data (He et al., 2022).

2.4 Dynamic and Task-Aware Data Selection

Submodular optimization (“facility location” functions (Renduchintala et al., 2023)) is employed to select diverse and representative pre-training subsets, yielding nearly 99% of full-data performance using only 25% of the data.
ATM (“Ask-LLM”) and Density Sampling explicitly optimize for label quality or feature-space coverage, often achieving better downstream results with over 90% data rejection (Sachdeva et al., 15 Feb 2024).
Group-level data influence models (e.g., Group-MATES (Yu et al., 20 Feb 2025)) explicitly optimize the joint impact of subgroup data on loss reduction, rather than greedy individual scoring.

2.5 Online and Dynamic Curation

SCAN exemplifies dynamic, loss-driven batch-level pruning for contrastive pre-training (e.g., CLIP or MoCo), iteratively culling both redundant (low-loss) and ill-matched (high-loss) samples using a cyclic, bootstrapping scheduler. This method reduces data required by 30–35% with <1% drop in accuracy (Guo et al., 14 Nov 2024).

2.6 Data-Selective Retrieval and Packing

Task-specific retrieval frameworks (e.g., SEPT) employ instance-level nearest-neighbor selection in a representation space, enabling models to pre-train on a reduced, distribution-aligned subset without label supervision, often with order-of-magnitude reductions in sample count for similar or better task performance (Lin et al., 2022).

2.7 Architectural and Curriculum Innovations

ELLE adaptively grows model width and depth in a function-preserving manner combined with domain prompt conditioning, allowing PLMs to incrementally ingest new data while reducing catastrophic forgetting and computational overhead (Qin et al., 2022). CASE-BERT and related models utilize high-quality curricular or expert-crafted data for initial pre-training in sensitive, low-resource domains (e.g., mental health), achieving high F1 with radically reduced dataset size (Harne et al., 1 Jun 2024).

3. Quantitative Outcomes and Performance Trade-offs

Data-efficient pre-training strategies generally achieve the following (select examples):

Method	Data Usage	Downstream/Transfer Performance	Speed/Compute Reduction
CLESS	60MB (“task-internal”)	Outperforms RoBERTa (160GB) in zero/few-shot and generalization	1/5th RoBERTa fine-tuning time
DeCLIP	56–88M pairs (~1/7 CLIP)	+2.9% zero-shot Top-1 accuracy (ImageNet) vs. CLIP (400M)	More efficient scaling
KDEP	10x less data (ImageNet-1K)	Matched/exceeded supervised baselines on 9 datasets	5x less training time
Ingenious	25% selected subset	~99% of full model GLUE/F1 performance	~70% savings in cost/time
ATM/Density	10% (ATM), 20% (Density)	ATM: exceeds full-data, Density: recovers full-data performance	70% faster convergence
SCAN	Prunes 30%+ data	<1% accuracy loss, often outperforms static coresets	~25–30% reduction in wall time
MedDEL	5% images (medical)	mIoU performance matches or marginally lags full-data	Massive storage/speed gain

Such outcomes indicate that informative, redundant, or “hard” samples are key, and that model/architecture decisions should be coupled with data curation for optimal benefit.

4. Domain-Specific Adaptations and Modalities

Data-efficient pre-training approaches span a wide variety of domains, each being tailored to their modality-specific challenges:

Language: Submodular selection, curriculum on curricular texts, density/coverage in embedding space, and function-preserving model expansion all enable efficient scale-up and maintenance of LLMs as corpora grow.
Vision & Multimodal: Data-efficient CLIP variants (DeCLIP, FLAME) leverage enhanced supervision cues and prompt engineering to maximize transfer with less cross-modal data. Retrieval frameworks and dynamic pruning (SCAN) yield better generalization for the same or less data in image-text representation learning.
Event Cameras / Non-standard Sensing: SSL with semantic-uniform masking and disentangled decoders (local/global branches) overcome unique challenges of sparsity and non-uniformity in event-based data, avoiding wasteful conversion to 2D frames (Huang et al., 1 Mar 2024).
Medical & Sensitive Domains: Aggressive filtering, clustering, and domain expert curation (as with MedDEL or CASE-BERT) address both sample efficiency and data privacy constraints.
Code LMs: Obfuscation grounding (ObscuraCoder) and sequence-to-sequence translation between code and its obfuscated forms facilitate more effective disentanglement of syntax and semantics, reducing the reliance on massive code datasets and enhancing generalization (Paul et al., 27 Mar 2025).
Performance Modeling: In code performance learning, self-supervised pre-training of autoencoders drastically reduces labeled data requirements by learning useful program representations before downstream prediction (Liu et al., 24 Jan 2025).

5. Algorithmic and Theoretical Considerations

Several mathematical and algorithmic formalisms underpin data-efficient pre-training:

Contrastive Loss (NCE/InfoNCE):

$\mathcal{L}_{clf} = -\frac{1}{|D_t|} \sum_{i=1}^{|D_t|} \log \frac{\exp(f(I_i)^T g(T_i)/\tau)}{\sum_{j=1}^{|D_t|} \exp(f(I_i)^T g(T_j)/\tau)}$

Loss-based dynamic pruning (e.g., in SCAN) evaluates per-sample loss for data pruning.

Submodularity:

$f_{FL}(S) = \sum_{i \in V} \max_{j \in S} s_{ij}$

Used to formalize the facility location subset selection problem (Renduchintala et al., 2023).

Feature Alignment Loss (KDEP):

$L = \frac{1}{N_u} \sum_{i=1}^{N_u} \| F^t(x_{u_i}) - F^s(x_{u_i}) \|^2$

Autoregressive and Translation Objectives in Code (ObscuraCoder):

The model is trained to predict one code form from the other, closely coupled with syntax-aware token manipulation and sentinel tokens to mark translation direction.

These algorithmic bases allow theoretical analysis of efficiency/coverage trade-offs, set-function optimization guarantees, and inspection of learned representation structure (e.g., compactness, domain alignment).

6. Practical Implications and Deployment

Data-efficient pre-training offers several practical and operational advantages:

Resource Savings: Reduction of computation, memory, and environmental impact (CO₂ emissions) by orders of magnitude.
Speed: Faster convergence (70%+ improvements reported), enabling more frequent model updates and experimentation.
Deployability: Effective pre-training on low-resource or private/sensitive domains, including mental health, medicine, and specialized industrial settings.
Improved Generalization and Fairness: Enhanced long-tail and minority class generalization, preservation of crucial features in scarce data, and better model fairness.
Open-Source Ecosystem: Recent works (e.g., Open-Qwen2VL) emphasize full release of code, data curation methods, packing pipelines, and model checkpoints, lowering research and reproducibility barriers (Wang et al., 1 Apr 2025).

Potential trade-offs include modest drops in aggregate performance (typically <2%) which are often justified by efficiency gains, or the upfront engineering effort to implement non-standard data selection and management strategies.

7. Future Research Directions

Continued advances in data-efficient pre-training are expected in:

Adaptive/Evolving Selection: Real-time data selection and re-weighting via online bandits, sample-level or group-level influence modeling (Albalak et al., 2023, Yu et al., 20 Feb 2025).
Modality Expansion: Extension of core ideas to richer modalities (video, graph, multi-modal medical data).
Automated Curation Pipelines: Integration of similarity metrics, auto-clusters, unsupervised or LLM-driven quality scoring to autonomously assemble compact, information-rich datasets suitable for continual or federated setups.
Theory and Analysis: Deeper paper into the theoretical limits of self-supervision signal scaling and the formal properties of learned representations for robust, low-shot generalization.
Interplay with Model Scaling: Optimizing data efficiency as a function of model size, architecture, and loss shaping, alongside careful paper of the scaling laws for data and compute in various domains.

Collaborative benchmarking and the open release of efficient yet high-performing models, data, and code are becoming standard, accelerating progress and democratizing access in this field.

In summary, data-efficient pre-training encompasses a set of architectural, objective-driven, and data-centric advances that optimize model quality per sample of data and compute. The field has demonstrated that, through the use of contrastive objectives, intelligent data selection, dynamic curation, and domain-adaptive techniques, state-of-the-art results in both general and specialized applications can be achieved with dramatically less data and energy expenditure. The continued evolution of these methods is expected to drive scalable, sustainable, and accessible machine learning research and deployment.