Data-Centric Pre-Training Recipe
- Data-centric pre-training recipe is a systematic approach to curating, filtering, and weighting data that maximizes model generalization and task performance.
- It integrates adaptive selection methods, such as heuristic and similarity-based filtering, with modular pipelines to process heterogeneous data sources.
- Empirical results demonstrate significant gains in accuracy and efficiency, with improvements observed in metrics across video-language, vision, and domain-specific applications.
A data-centric pre-training recipe is a systematic approach to curating, filtering, and weighting input data prior to large-scale model training, with the objective of maximizing downstream generalization and task performance. This concept encompasses methodologies for selecting source corpora, adapting datasets to target domains, and integrating modular pipelines for flexible and efficient construction and evaluation of data mixtures (also termed “recipes” in recent systems). Data-centricity stands in contrast to model-centric paradigms by prioritizing dataset composition and structure over architecture or hyperparameter choices. Contemporary research spans video–LLMing, LLMs, vision–robotics, procedural reasoning, code understanding, and radiology, with methods grounded in both heuristic and automatic techniques. This article presents the principal ideas, implementation details, technical challenges, and applications of data-centric pre-training recipes as established in the literature.
1. Motivation and Definition
Data-centric pre-training recipes arise from the observation that large-scale model efficacy is closely coupled with the structure, diversity, and relevance of training data. In contrast to indiscriminate use of exhaustive corpora, structured selection and adaptation can bridge domain gaps (e.g., cooking vs. movies (Zhou et al., 2021)), boost generalizability, and increase compute and sample efficiency. The recipes formalize dataset blending, curation, and weighting as an explicit, iterative design space—where the “mixture” of datasets (e.g., web text, dialogues, code, academic papers) and operator-driven transformation pipelines are modular and dynamically configurable (Chen et al., 2023). This paradigm is validated through empirical demonstrations across language, video-language, and vision domains, underscoring the importance of strategic data design over scale alone.
2. Adaptive Curation and Filtering Strategies
Recent frameworks such as CUPID (Zhou et al., 2021) systematize adaptive data curation via two main strategies:
- Heuristic Adaptive Pre-training (HAP): Filters based on meta-data (e.g., category, title overlap). Videos labeled “Food and Entertaining,” with metadata overlapping downstream targets, are selected for cooking applications.
- Similarity-based Adaptive Pre-training (SAP): Embeds all source and target videos into a common space, computes similarity via:
where are clip embedding matrices. Two selection variants are employed: averaged similarity (mean pooling of columns) and K-nearest neighbors (selecting high-scoring source videos).
Curating source sets down to as little as 1.2–1.3% of the full HowTo100M corpus yields consistent gains, with BLEU-3 increases from 11.57 to 14.42 and METEOR from 13.92 to 16.47 for video captioning. Recall@1 on YouCook2 retrieval rises from 15.8% to 17.67%. Out-of-domain QA and retrieval tasks see 3–7% improvements while using 80% fewer videos.
3. Modular Data Recipe Pipelines
Data-Juicer (Chen et al., 2023) extends data-centric curation with modular processing pipelines, building “recipes” from heterogeneous input sources. The architecture features:
Component | Function | Example Operator |
---|---|---|
Formatter | Ingest/unify raw formats | .txt, .json, code |
Mapper | Edit text in place | Clean headers |
Filter | Select/discard samples | Toxicity, length |
Deduplicator | Remove repeats | Hash/vector based |
Weighted data mixing enables recipe optimization by source, with integration of hyperparameter optimization (e.g., Bayesian methods) on mixture weights. Recipes are iteratively refined to maximize target metrics:
where is total tokens, mixed tokens, and a classifier-derived mean quality score. Data-Juicer demonstrated 7.45% average score improvement across 16 HELM benchmarks and 17.5% higher GPT-4 win rate in instructions following.
4. Task-Sensitive and Domain-Focused Blending
Pre-training objectives are sensitive to domain discrepancies (Zhou et al., 2021); reconstructive objectives (MLM, HERO) drop from 54.51% accuracy (in-domain) to 31.17% (TV domain), with contrastive objectives showing near-0.5% recall off-domain. Data-centric recipes mitigate these drops by maximizing match between source and target distributions—either via structure-aware filtering (metadata, similarity) or by modular recipe construction (as in Data-Juicer).
Specialized strategies emerge in non-language domains. SlotMIM (Wen et al., 10 Mar 2025) for robotics introduces a “semantic bottleneck” (fewer prototypes, e.g., 512–1024 vs. 8192) and cross-view consistency regularization to induce object-centric features, overcoming failures of DINO/iBOT on non-object-centric datasets. MedCutMix (Wang et al., 20 Sep 2025) augments radiology VLP by mixing disease-relevant sentences and guiding attentive manifold mixing within images, using cross-attention maps:
5. Evaluation Metrics and Performance Gains
Data-centric pre-training recipes are validated through robust gains on downstream benchmarks. For example, curated video–language data (Zhou et al., 2021) improved YouCook2 caption BLEU-3 by ~25%, retrieval Recall@1 by 1.87 absolute percentage points, and out-of-domain QA by 3–7%. Data-Juicer (Chen et al., 2023) models increased HELM average scores by 7.45% and GPT-4 win rates by 17.5%. MedCutMix (Wang et al., 20 Sep 2025) raised CheXpert 5-class AUC from 0.8452 to 0.8535 and F1 from 0.4491 to 0.4638.
6. Comparative Analysis and Trade-Offs
Data-centric recipes surpass random sampling and legacy pre-training approaches in both efficiency and generalization. Efficient continued pre-training (Parmar et al., 9 Jul 2024) switches from generic blends to QA-focused distributions at an adaptive point tied to learning rate (e.g., ), resulting in a 9% accuracy improvement over naive baselines. The “reuse, don’t retrain” strategy demonstrates compute and sample efficiency, validating the leveraged pre-trained models for further improvement.
Methods such as Sieve-&-Swap (Batra et al., 2023) and order-based pre-training (Nandy et al., 6 Apr 2024) focus on filtering low-quality examples and training with structure-aware (e.g., permutation or embedding-based) objectives, further emphasizing the data paradigm. Obfuscation grounding in Code-LMs (Paul et al., 27 Mar 2025) demonstrates robustness to syntactic variation and improves library-oriented code generation relative to vanilla autoregressive or de-obfuscation objectives.
Limitations of data-centric approaches include dependence on the quality and granularity of metadata, the computational cost of embedding-based similarity computation, and potential underrepresentation of important variation if filtering thresholds are miscalibrated. Dynamic, feedback-driven recipe optimization is required to prevent excessive narrowing or loss of diversity.
7. Future Directions and Societal Implications
Recent position papers argue that scaling LLMs will increasingly depend on transparent, systematic data curation and benchmarking rather than exclusive focus on architectural improvements (Xu et al., 20 Jun 2024). Research directions outlined involve:
- Development of data-centric benchmarks that isolate gains due to data composition, curation, and diversity.
- Statistical optimization of domain selection (e.g., maximizing distributional divergence) and diversity within domains (with methods from DPPs, coreset selection).
- Data valuation for domain adaptation (influence functions, Shapley values).
- Structured knowledge transfer and inference contextualization (e.g., efficient retrieval-based generation, in-context example optimization).
- Enhanced attribution, privacy, and unlearning mechanisms for responsible AI deployment.
A plausible implication is that the discipline will move toward dynamic, reproducible data recipes where configuration and evaluation are accessible, iteration is rapid, and domain adaptation is informed by both quantitative and qualitative metrics.
References to Key Papers
- CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning (Zhou et al., 2021)
- Data-Juicer: A One-Stop Data Processing System for LLMs (Chen et al., 2023)
- MedCutMix: A Data-Centric Approach to Improve Radiology Vision-Language Pre-training with Disease Awareness (Wang et al., 20 Sep 2025)
- A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning (Wen et al., 10 Mar 2025)
- Reuse, Don't Retrain: A Recipe for Continued Pretraining of LLMs (Parmar et al., 9 Jul 2024)
- Data-Centric AI in the Age of LLMs (Xu et al., 20 Jun 2024)
This synthesis reflects the multi-disciplinary convergence toward data-centricity in pre-training, documenting modular, systematically optimized recipes as essential for advancing model robustness, generalizability, and efficiency in diverse AI domains.