Data-Centric Scaling
- Data-centric scaling is a framework that leverages high-quality, diverse, and curated data to drive model performance in machine learning systems.
- Empirical studies show that effective data volume, enhanced by metrics like coherence and readability, yields sublinear yet significant performance gains.
- Practical strategies include quality-attribute quantification, diversity-optimized sample selection, and token compression to optimize resource utilization and model accuracy.
Data-centric scaling refers to frameworks, methodologies, and mathematical laws that characterize how the quality, diversity, organization, and effective use of data determine or limit system performance as the scale of data grows, often independent of—or synergistic with—parameter or model-centric scaling. In contrast to approaches that focus solely on increasing model complexity or capacity, data-centric scaling emphasizes the pivotal role of data attributes (such as diversity, quality, composition, and curation strategy) in achieving efficient learning, generalization, and resource utilization across a wide range of machine learning, deep learning, and statistical modeling regimes.
1. Principles of Data-Centric Scaling
Data-centric scaling is anchored by the observation that, in many modern learning settings, increasing the volume or heterogeneity of input data is often at least as important as, and sometimes more effective than, increasing model size or architectural complexity. Several key principles have emerged in recent research:
- Effective Data Volume over Raw Volume: The concept of effective data volume—data tokens weighted by human-centric quality metrics such as coherence, readability, and task similarity—explains more of the achievable performance gain at fixed model size than raw data volume alone. In Ziya2, for a fixed-architecture LLM, loss falls as a power law of an effective-token count parameterized by interpretable data attributes (Gan et al., 2023).
- Diminishing Returns and Scaling Laws: Both empirical and theoretical findings indicate that loss/accuracy scaling with additional data is typically sublinear, either logarithmic or low-exponent power-law (loss ∝ D–α, α ≪ 1), with saturating gains beyond a certain scale (Gan et al., 2023, Tang et al., 2024, Huang et al., 20 Dec 2025).
- Diversity Prioritization: For many transfer and detection tasks, diversity—across data sources, generators, and domains—exhibits a power-law effect on error reduction, often outperforming mere volume aggregation under fixed resources (Huang et al., 20 Dec 2025).
- Synergy with Model Scaling: Pure model scaling yields diminishing marginal returns if data scaling (volume, diversity, coverage) is neglected. Empirically, scaling both in concert yields a synergistic effect, with larger architectures extracting higher gains from increased data scope (Yu et al., 25 Mar 2026).
2. Mathematical Laws and Empirical Findings
2.1 Data-Weighted Scaling Laws for LLMs
In foundational LLM work, the conventional loss scaling law,
is extended to situations where model size N is fixed and only “effective” data volume matters. Using composite data metrics—coherence (CH), readability (RA), similarity (SIM)—the effective data volume is
and loss scales as
demonstrating that quality improvements in CH and RA far outstrip comparable improvements in SIM (Gan et al., 2023). This provides actionable guidance: data cleaning and curation that enhance coherence/readability should be prioritized.
2.2 Logarithmic and Power-law Scaling in Instruction Tuning
For instruction-tuned visual-LLMs, accuracy and convergence loss empirically follow a logarithmic law:
where is the number of high-quality instruction samples. Each tenfold data increase yields incremental but non-vanishing improvements, e.g., a 6-point percentage gain per decade in TextSquare (Tang et al., 2024).
2.3 Diversity Scaling Laws in Detection
In speech deepfake detection, generalization error (CDE) as a function of the number of sources () or forgery generators () obeys
signifying that increasing diversity, not just total hours, is the primary driver of detector robustness (Huang et al., 20 Dec 2025).
3. Methodologies and Strategies
3.1 Quality-Attribute Quantification and Data-weighted Pretraining
High-quality datasets are constructed by quantifying and maximizing explicit data attributes:
- Attribute annotation: Randomly sample pretraining tokens for human annotation; compute pass rates for coherence/readability, and cosine similarity for domain matching.
- Weighted loss fitting: Use regression or Huber loss to fit scaling exponents and attribute weights, guiding curriculum design (Gan et al., 2023).
3.2 Diversity-Optimized Sample Selection
When aggregating from heterogeneous datasets, optimal strategies include:
- DOSS-Select: Cap the maximum per-domain sample count and enforce a fixed class ratio, selecting balanced subsets.
- DOSS-Weight: Assign all domains a sampling weight according to capped volume and a temperature parameter τ, normalizing to preserve diversity without discarding data (Huang et al., 20 Dec 2025).
3.3 Token Compression as Data-centric Scaling
For sequence models with prohibitive context lengths, token compression techniques (scoring/ranking tokens by importance, pruning or merging) can quadratically reduce self-attention cost,
where is the original and the compressed sequence length. Random aggregation/truncation frequently outperforms sophisticated selection, likely due to avoiding position/semantic biases (Liu et al., 25 May 2025).
3.4 Data Source Utility Estimation
Combining curation costs and performance gains,
empirically fit over a grid of budgets, enables compute-aware resource allocation across competing data sources (Ostapenko et al., 29 Jul 2025).
4. Application Domains and Empirical Impact
- LLMs: Data-centric scaling governs pretraining regimes and continual training efficacy. Emphasis on quality attributes (as in Ziya2) yields loss reductions with smaller or equal compute compared to brute scaling (Gan et al., 2023).
- Multimodal/Instruction-Tuned Models: In VQA, dataset scale and fine-grained filtering substantially close the gap to SOTA MLLMs (Tang et al., 2024).
- Speech and Vision Detection: Power-law gains from source/generator diversity enable models trained on a small, balanced sample to surpass those trained on much larger, imbalanced aggregates (Huang et al., 20 Dec 2025).
- Efficient Inference/Training: Random token compression yields consistent speedup without performance loss across complex LLM, MLLM, video LLM, and diffusion transformer domains (Liu et al., 25 May 2025).
- Transfer Learning: The “distillation boundary” identifies cutoffs where knowledge distillation outperforms pure finite-sample learning, transitioning as more data becomes available (Yang et al., 17 Apr 2025).
5. Comparison with Model-Centric Scaling
Data-centric scaling frameworks complement traditional model-centric regimes. For instance, model scaling laws exhibit diminishing returns, and further improvements are often gated by data heterogeneity and quality (Yu et al., 25 Mar 2026). Synergistic scaling—expanding both data and model axes—unlocks higher marginal returns, validated by studies in large-scale search ranking and recommendation (Yu et al., 25 Mar 2026).
6. Challenges, Limitations, and Future Directions
- Measurement of Data Quality: Human-centered metrics are intrinsic but expensive; embedding-based proxies remain active research topics (Gan et al., 2023, Tang et al., 2024).
- Empirical Saturation and Extrapolation: Scaling laws fit over practical data/compute budgets may not extrapolate to arbitrarily large scales; new phenomena can emerge (e.g., diminishing or negative returns in knowledge distillation above a critical data threshold) (Yang et al., 17 Apr 2025).
- Fair Benchmarking: Ensuring that speedup claims for data-centric strategies (e.g., token compression) translate to wall-clock and resource efficiency is an open standardization need (Liu et al., 25 May 2025).
- Co-optimization of Data and Model: Future frameworks seek to jointly optimize model weights and data selection/compression, requiring integrated pipelines and theoretical guarantees.
7. Practical Guidance and Methodological Best Practices
- Prioritize maximizing high-coherence/readability content before domain similarity unless targeting extreme domain adaptation (Gan et al., 2023).
- Use data-diversity optimized selection (e.g., DOSS-Select/Weight) in multi-source regimes (Huang et al., 20 Dec 2025).
- Construct scaling curves by running multiple resource allocations, not relying on point estimates, for robust data source selection (Ostapenko et al., 29 Jul 2025).
- Benchmark instruction/data-centric scaling using logarithmic or low-exponent power-law fits, spanning a wide range of data sizes (Tang et al., 2024).
- When resource-constrained, prefer assembling small, diverse, domain-capped multisource mixtures over large single-domain aggregates. Apply moderate temperature weighting to avoid overfitting to large domains (Huang et al., 20 Dec 2025).
Key References:
- Ziya2: Data-centric Learning is All LLMs Need (Gan et al., 2023)
- TextSquare: Scaling up Text-Centric Visual Instruction Tuning (Tang et al., 2024)
- A Data-Centric Approach to Generalizable Speech Deepfake Detection (Huang et al., 20 Dec 2025)
- Shifting AI Efficiency From Model-Centric to Data-Centric Compression (Liu et al., 25 May 2025)
- Scaling Laws for Data-Efficient Visual Transfer Learning (Yang et al., 17 Apr 2025)
- Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training (Ostapenko et al., 29 Jul 2025)
- UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking (Yu et al., 25 Mar 2026)