Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 76 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Data-Centric Pre-Training Recipe

Updated 2 October 2025
  • Data-centric pre-training recipe is a systematic approach to curating, filtering, and weighting data that maximizes model generalization and task performance.
  • It integrates adaptive selection methods, such as heuristic and similarity-based filtering, with modular pipelines to process heterogeneous data sources.
  • Empirical results demonstrate significant gains in accuracy and efficiency, with improvements observed in metrics across video-language, vision, and domain-specific applications.

A data-centric pre-training recipe is a systematic approach to curating, filtering, and weighting input data prior to large-scale model training, with the objective of maximizing downstream generalization and task performance. This concept encompasses methodologies for selecting source corpora, adapting datasets to target domains, and integrating modular pipelines for flexible and efficient construction and evaluation of data mixtures (also termed “recipes” in recent systems). Data-centricity stands in contrast to model-centric paradigms by prioritizing dataset composition and structure over architecture or hyperparameter choices. Contemporary research spans video–LLMing, LLMs, vision–robotics, procedural reasoning, code understanding, and radiology, with methods grounded in both heuristic and automatic techniques. This article presents the principal ideas, implementation details, technical challenges, and applications of data-centric pre-training recipes as established in the literature.

1. Motivation and Definition

Data-centric pre-training recipes arise from the observation that large-scale model efficacy is closely coupled with the structure, diversity, and relevance of training data. In contrast to indiscriminate use of exhaustive corpora, structured selection and adaptation can bridge domain gaps (e.g., cooking vs. movies (Zhou et al., 2021)), boost generalizability, and increase compute and sample efficiency. The recipes formalize dataset blending, curation, and weighting as an explicit, iterative design space—where the “mixture” of datasets (e.g., web text, dialogues, code, academic papers) and operator-driven transformation pipelines are modular and dynamically configurable (Chen et al., 2023). This paradigm is validated through empirical demonstrations across language, video-language, and vision domains, underscoring the importance of strategic data design over scale alone.

2. Adaptive Curation and Filtering Strategies

Recent frameworks such as CUPID (Zhou et al., 2021) systematize adaptive data curation via two main strategies:

  • Heuristic Adaptive Pre-training (HAP): Filters based on meta-data (e.g., category, title overlap). Videos labeled “Food and Entertaining,” with metadata overlapping downstream targets, are selected for cooking applications.
  • Similarity-based Adaptive Pre-training (SAP): Embeds all source and target videos into a common space, computes similarity via:

Kji=Mean((Mtj)TMsi)K_{ji} = \mathrm{Mean}((M_t^j)^T M_s^i)

where Msi,MtjM_s^i, M_t^j are clip embedding matrices. Two selection variants are employed: averaged similarity (mean pooling of columns) and K-nearest neighbors (selecting high-scoring source videos).

Curating source sets down to as little as 1.2–1.3% of the full HowTo100M corpus yields consistent gains, with BLEU-3 increases from 11.57 to 14.42 and METEOR from 13.92 to 16.47 for video captioning. Recall@1 on YouCook2 retrieval rises from 15.8% to 17.67%. Out-of-domain QA and retrieval tasks see 3–7% improvements while using 80% fewer videos.

3. Modular Data Recipe Pipelines

Data-Juicer (Chen et al., 2023) extends data-centric curation with modular processing pipelines, building “recipes” from heterogeneous input sources. The architecture features:

Component Function Example Operator
Formatter Ingest/unify raw formats .txt, .json, code
Mapper Edit text in place Clean headers
Filter Select/discard samples Toxicity, length
Deduplicator Remove repeats Hash/vector based

Weighted data mixing enables recipe optimization by source, with integration of hyperparameter optimization (e.g., Bayesian methods) on mixture weights. Recipes are iteratively refined to maximize target metrics:

Target metric=(nN+s)\text{Target metric} = \left( \frac{n}{N} + s \right)

where NN is total tokens, nn mixed tokens, and ss a classifier-derived mean quality score. Data-Juicer demonstrated 7.45% average score improvement across 16 HELM benchmarks and 17.5% higher GPT-4 win rate in instructions following.

4. Task-Sensitive and Domain-Focused Blending

Pre-training objectives are sensitive to domain discrepancies (Zhou et al., 2021); reconstructive objectives (MLM, HERO) drop from 54.51% accuracy (in-domain) to 31.17% (TV domain), with contrastive objectives showing near-0.5% recall off-domain. Data-centric recipes mitigate these drops by maximizing match between source and target distributions—either via structure-aware filtering (metadata, similarity) or by modular recipe construction (as in Data-Juicer).

Specialized strategies emerge in non-language domains. SlotMIM (Wen et al., 10 Mar 2025) for robotics introduces a “semantic bottleneck” (fewer prototypes, e.g., 512–1024 vs. 8192) and cross-view consistency regularization to induce object-centric features, overcoming failures of DINO/iBOT on non-object-centric datasets. MedCutMix (Wang et al., 20 Sep 2025) augments radiology VLP by mixing disease-relevant sentences and guiding attentive manifold mixing within images, using cross-attention maps:

Ci=softmax((Eimg,iEsent,iT)/τ1)C_i = \sum \mathrm{softmax}((E_{img,i} \cdot E_{sent,i}^T)/\tau_1)

5. Evaluation Metrics and Performance Gains

Data-centric pre-training recipes are validated through robust gains on downstream benchmarks. For example, curated video–language data (Zhou et al., 2021) improved YouCook2 caption BLEU-3 by ~25%, retrieval Recall@1 by 1.87 absolute percentage points, and out-of-domain QA by 3–7%. Data-Juicer (Chen et al., 2023) models increased HELM average scores by 7.45% and GPT-4 win rates by 17.5%. MedCutMix (Wang et al., 20 Sep 2025) raised CheXpert 5-class AUC from 0.8452 to 0.8535 and F1 from 0.4491 to 0.4638.

6. Comparative Analysis and Trade-Offs

Data-centric recipes surpass random sampling and legacy pre-training approaches in both efficiency and generalization. Efficient continued pre-training (Parmar et al., 9 Jul 2024) switches from generic blends to QA-focused distributions at an adaptive point tied to learning rate (e.g., ηswitch=ηmax;ct/5\eta_{switch} = \eta_{max;ct} / 5), resulting in a 9% accuracy improvement over naive baselines. The “reuse, don’t retrain” strategy demonstrates compute and sample efficiency, validating the leveraged pre-trained models for further improvement.

Methods such as Sieve-&-Swap (Batra et al., 2023) and order-based pre-training (Nandy et al., 6 Apr 2024) focus on filtering low-quality examples and training with structure-aware (e.g., permutation or embedding-based) objectives, further emphasizing the data paradigm. Obfuscation grounding in Code-LMs (Paul et al., 27 Mar 2025) demonstrates robustness to syntactic variation and improves library-oriented code generation relative to vanilla autoregressive or de-obfuscation objectives.

Limitations of data-centric approaches include dependence on the quality and granularity of metadata, the computational cost of embedding-based similarity computation, and potential underrepresentation of important variation if filtering thresholds are miscalibrated. Dynamic, feedback-driven recipe optimization is required to prevent excessive narrowing or loss of diversity.

7. Future Directions and Societal Implications

Recent position papers argue that scaling LLMs will increasingly depend on transparent, systematic data curation and benchmarking rather than exclusive focus on architectural improvements (Xu et al., 20 Jun 2024). Research directions outlined involve:

  • Development of data-centric benchmarks that isolate gains due to data composition, curation, and diversity.
  • Statistical optimization of domain selection (e.g., maximizing distributional divergence) and diversity within domains (with methods from DPPs, coreset selection).
  • Data valuation for domain adaptation (influence functions, Shapley values).
  • Structured knowledge transfer and inference contextualization (e.g., efficient retrieval-based generation, in-context example optimization).
  • Enhanced attribution, privacy, and unlearning mechanisms for responsible AI deployment.

A plausible implication is that the discipline will move toward dynamic, reproducible data recipes where configuration and evaluation are accessible, iteration is rapid, and domain adaptation is informed by both quantitative and qualitative metrics.

References to Key Papers

This synthesis reflects the multi-disciplinary convergence toward data-centricity in pre-training, documenting modular, systematically optimized recipes as essential for advancing model robustness, generalizability, and efficiency in diverse AI domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Data-Centric Pre-training Recipe.