Papers
Topics
Authors
Recent
2000 character limit reached

Data-Centric AI: Optimizing Data Quality

Updated 29 November 2025
  • Data-Centric AI is a paradigm that focuses on improving model performance by systematically refining data quality through cleaning, labeling, and augmentation.
  • It operationalizes a lifecycle framework that spans data discovery, integration, maintenance, and benchmarking, ensuring robust evaluation across modalities.
  • Advanced co-design practices merge model feedback with data interventions to enhance fairness, robustness, and scalability in diverse AI applications.

Data-Centric AI is a paradigm in artificial intelligence that treats the systematic design, engineering, and ongoing improvement of data as the central driver of model performance and reliability. Unlike the traditional model-centric approach—which seeks superior results through increasingly complex architectures given a static dataset—data-centric AI optimizes data quality, coverage, and relevance, often holding the model fixed during iterative cycles of cleaning, curation, and augmentation. This perspective is critical for domains where data imperfections (scarcity, noise, bias, incompleteness) are the limiting factors for real-world deployment, and has been advanced via comprehensive taxonomies, principled methodologies, benchmark frameworks, and emerging co-design strategies (Jakubik et al., 2022, Jarrahi et al., 2022, Zha et al., 2023, Park et al., 4 Mar 2024).

1. Foundational Concepts and Formal Definitions

Data-centric AI reconceptualizes an AI system as the tuple S=(D,M,Θ)\mathcal{S} = (D, M, \Theta), with DD as the data (features, labels), MM as the model class, and Θ\Theta as model parameters (Jarrahi et al., 2022). Where model-centric AI pursues:

minθL(M(x;θ),y)\min_{\theta} \mathcal{L}(M(x;\theta), y)

with DD fixed, the data-centric problem is formulated as an outer loop optimization:

minDDL(M(x;θ0),y)\min_{D' \in \mathcal{D}} \mathcal{L}(M(x;\theta_0), y)

searching for improvements in data DD (or data interventions Δ\Delta) that reduce task loss for a fixed model. This shift grounds the paradigm in systematic interventions: cleaning, labeling, augmentation, and refinement, validated by their direct impact on downstream metrics (accuracy, F1_1, robustness, fairness) (Jakubik et al., 2022, Zha et al., 2023).

2. Life Cycle Taxonomies and Operational Pipelines

Comprehensive surveys structure data-centric AI along the machine learning pipeline, dividing it into three or more core stages (Zha et al., 2023, Jakubik et al., 2022, Zha et al., 2023):

  • Training Data Development: Includes dataset discovery, integration, labeling (manual, semi-supervised, active, weak supervision), data cleaning, feature engineering/transformation, reduction (feature/instance selection, dimensionality reduction), and augmentation (basic, generative).
  • Inference Data Development: Focused on synthetic or adversarial evaluation sets, data slicing for subgroup analysis, OOD robustness testing, and prompt engineering for LLMs.
  • Data Maintenance and Monitoring: Encompasses visualization, valuation, quality assurance, anomaly/drift detection, data versioning, storage, and efficient retrieval in production.

Advanced platforms and pipelines (e.g., DataCI for streaming data (Zhang et al., 2023), DC-Check for reliability (Seedat et al., 2022)) operationalize these stages, supporting continuous ingestion, versioning, lineage tracking, automated validation, and iterative updating.

3. Principal Data-Centric Techniques

Across modalities, representative data interventions are grounded in established technical methods:

  • Label Error Identification: Confident learning scores, instance hardness metrics, Data Shapley, and per-sample influence functions (Jakubik et al., 2022, Lee et al., 2021, Hansen et al., 2023).
  • Feature Selection and Generation: Filter, wrapper, and embedded methods for selection; engineered, generative-model–based, and RL-driven methods for generation in tabular and time-series data (Wang et al., 17 Jan 2025, Xu et al., 29 Jul 2024).
  • Data Augmentation: Mixup, AutoAugment policy search, GAN/VAE/diffusion synthesis for underrepresented classes, edge-case enrichment, and curriculum-based or adaptive strategies (Lee et al., 2021, Zha et al., 2023, Guo et al., 2023).
  • Data Profiling: Cleanlab, Data-IQ, Data Maps for categorizing data into “easy,’’ “ambiguous,” and “hard” buckets, guiding synthetic data pipelines and evaluation (Hansen et al., 2023).

In graph learning, data-centric operations comprise topological edits (DropEdge, diffusion, sparsification), feature manipulations (corruption, mixup, position encoding), label operations (mixup, distillation, correction), and both pre-training and inference-time prompting (Guo et al., 2023).

4. Benchmarking, Evaluation, and Automation

Benchmarking suites (e.g., DataPerf (Mazumder et al., 2022)) and domain-specific taxonomies formalize the evaluation of data-centric interventions. DataPerf benchmarks competition and comparability by fixing models and hyperparameters, attributing any performance gains strictly to dataset refinement. Metrics extend beyond statistical fidelity (KL, MMD, Wasserstein) to practical utility: AUROC for synthetic data, model/feature-selection Spearman’s ρ\rho, slice-wise worst-group losses, calibration and uncertainty scores (Hansen et al., 2023, Xu et al., 29 Jul 2024).

Emerging automation levels contrast fully programmatic routines (imputation, outlier removal) with learning-based (RL for transformations, automated augmentation) and Human-in-the-Loop designs (active learning, validation, slice discovery) (Zha et al., 2023). Continuous pipeline orchestration and leaderboard-driven iteration are establishing a culture of reproducibility and rapid improvement in production systems (Zhang et al., 2023).

5. Data-Model Co-Design and Model-Based Data-Centric AI

Recent research advocates tightly coupled co-evolution of data and model architectures. The Model-Based Data-Centric AI paradigm positions the target model as an active participant in data optimization, with iterative cycles of error-driven sampling, targeted labeling, model-guided synthesis, and live metric–driven feedback (Park et al., 4 Mar 2024). This moves beyond model-agnostic data curation and addresses industry–academia divides in annotation practices, metadata scope, and labeling granularity (e.g., inclusion of span indices, bounding boxes).

In LLMs, systematic benchmarks now evaluate the impact of curation (e.g., filtration by perplexity), redundancy minimization, provenance tracking (influence scores), synthetic data via model distillation, and inference contextualization (RAG, ICL with optimal demonstration selection) (Xu et al., 20 Jun 2024).

6. Challenges, Opportunities, and Future Directions

Key open problems include:

  • Scalability and Transferability: How to generalize automation across domains and modalities, particularly in high-dimensional or multimodal contexts (Wang et al., 17 Jan 2025, Xu et al., 29 Jul 2024).
  • Interpretability: Unpacking black-box generative or RL-based transformations for human validation and audit (Xu et al., 29 Jul 2024).
  • Fairness and Robustness: Systematic bias detection and repair, counterfactual augmentation, slice discovery, and resilience to distributional and adversarial drift (Zha et al., 2023, Jarrahi et al., 2022).
  • Continuous Data Maintenance: Enabling streaming pipelines that detect and adapt to drift, automate versioning, and optimize retraining cadence (Zhang et al., 2023, Seedat et al., 2022).
  • Ethical and Sociotechnical Integration: Embedding human-centered protocols, data governance, traceability, and collaborative annotation in routine deployment (Jarrahi et al., 2022).

Best practices are converging on standardized data-centric checklists, automated forensics, closed-loop retraining systems, explicit open-source data documentation, and benchmarking methodologies that attribute improvements to data interventions rather than model changes (Seedat et al., 2022, Mazumder et al., 2022).

7. Modalities and Domain-Specific Perspectives

Data-centric AI methods are now being systematically extended to graphs (data-centric graph learning (Guo et al., 2023)), time series (data-centric transformer-based forecasting (Xu et al., 29 Jul 2024)), tabular data (feature and instance profiling (Wang et al., 17 Jan 2025, Hansen et al., 2023)), image classification (valuation/augmentation pipelines (Lee et al., 2021)), and text/LLM regimes (data-centric curation, attribution, synthetic generation (Xu et al., 20 Jun 2024)).

Workflow variants span static, batch-centric processes to fully continuous streaming architectures, with versioned pipelines, lineage tracking, and orchestration frameworks (e.g., DataCI (Zhang et al., 2023)) supporting robust, reproducible deployments under data drift and evolving business requirements.


Systematic engineering of data—rather than continual escalation of model complexity—has emerged as the central discipline for unlocking reliable, performant, and fair AI systems in practice. Data-centric AI integrates iterative error analysis, human-in-the-loop refinement, statistical profiling, and adaptive co-design, delivering reproducible pipelines and benchmarked standards for real-world applications across all major data modalities (Jakubik et al., 2022, Zha et al., 2023, Guo et al., 2023, Mazumder et al., 2022, Park et al., 4 Mar 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Data-Centric AI.