Data-Centric AI: Quality-Driven AI
- Data-Centric AI (DCAI) is a systematic approach that focuses on enhancing data quality, consistency, and representativeness to drive AI performance.
- It encompasses all stages of the data lifecycle—from collection and annotation to augmentation and continuous evaluation—ensuring reproducibility and auditability.
- By shifting emphasis from complex models to robust data processes, DCAI enables more reliable, fair, and sustainable AI deployments.
Data-Centric AI (DCAI) is a systematic approach to ML and AI that prioritizes the quality, organization, and continuous improvement of data as the primary determinant of system performance. Moving beyond the traditional model-centric paradigm, which focuses on architectural or algorithmic advancements, DCAI recognizes that model accuracy, robustness, fairness, and usability in practical deployment critically depend on data fit, consistency, and representativeness. DCAI methodologies encompass all stages of the data lifecycle: collection, annotation, preparation, reduction, augmentation, evaluation, maintenance, and governance, with formal optimization of both the data and its supporting processes. The DCAI paradigm is extensively grounded in taxonomies, metrics, and pipelines that enable reproducibility, auditability, and systematic enhancement of AI systems (Jarrahi et al., 2022, Zha et al., 2023, Xu et al., 2024).
1. Conceptual Foundations and Definitions
Data-Centric AI (DCAI) is formally defined as the discipline of systematically engineering the data used to build and train AI systems, shifting the focus from designing ever-more complex models to ensuring higher-quality, richer, and more reliable data (Xu et al., 2024). The utility of an AI system is thus decomposed as
where is the model, the dataset, a standard evaluation metric (e.g., accuracy, AUC), and a composite data quality function (e.g., fit, consistency) (Jarrahi et al., 2022). In this framework:
- Data Fit reflects coverage and balance across relevant subpopulations and contexts.
- Data Consistency reflects annotation accuracy and adherence to labeling standards.
DCAI aims to iteratively increase in tandem with, or as a prerequisite for pursuing, improvements in , typically measured via delta improvements post-data interventions:
where stems from data-centric actions such as cleaning, augmentation, or relabeling (Jakubik et al., 2022).
2. Systems Taxonomy and Life Cycle
The DCAI workflow organizes methods into a structured lifecycle, typically inspired by CRISP-DM and extended to support both static and streaming data (Zha et al., 2023, Xu et al., 2024, Zhang et al., 2023):
- Input Data Phase: Encompasses dataset acquisition (discovery, integration), formal splitting (train/validation/test), and extensive preprocessing, including normalization, covariate encoding, feature construction (lag, windowing), decomposition (seasonality/trend), reduction (PCA), and domain-specific augmentation (time-warping, noise injection).
- Data-Model Interaction Phase: Entails encoding data for model consumption, including learned or engineered input embeddings, positional encodings (critical for sequential data), and embedding architectures that optimize for cross-variable and temporal dependencies.
- Output Evaluation Phase: Uses multi-faceted metrics—predictive (MSE, MAE, quantile loss, log-likelihood), uncertainty measures (CRPS, energy score), and computational/sustainability indices (runtime, GPU memory, carbon footprint)—to guide and assess data-centric interventions (Xu et al., 2024).
- Maintenance and Governance: Focuses on version control (DVC, LakeFS), data drift detection (KL, Wasserstein), data valuation (Shapley, influence scores), and continuous curation. DCAI documentation is evolving from model-level “Model Cards” to pipeline-centric “DAG Cards” to trace all upstream and downstream data-centric operations (Tagliabue et al., 2021).
This staged lifecycle is tightly coupled with standardized artifacts (datasets, code, metadata, visualizations), rigorous audit trails, and orchestration tools for both batch and streaming scenarios (Zhang et al., 2023).
3. Data-Centric Optimization Strategies
DCAI introduces programmatic methodologies for data selection, shaping, and evaluation:
- Data Acquisition: The DAM challenge formalizes acquisition as an optimization problem over candidate data items with utility scores and costs , seeking to maximize
subject to a budget constraint (Chen et al., 2023). Best-practice strategies employ two-stage clustering, probabilistic sampling (Bayesian optimization), and explicit diversity and fairness constraints.
- Model-Based DCAI: Jointly optimizes over model parameters and data modifications :
incorporating data regularization () and model-dependent example weights to focus curation or augmentation on “hard” cases (Park et al., 2024).
- Tabular Data Optimization: Employs RL-based feature selection (formulated as an MDP with policy gradient or Q-learning) and generative-feature synthesis (VAEs, GANs, Diffusion, DPPs), balancing interpretability, automation, and performance via fine-grained policies (Ying et al., 12 Feb 2025).
4. Principles, Metrics, and Documentation
DCAI is underpinned by six guiding principles:
- Systematic improvement of data fit (coverage)
- Systematic improvement of data consistency (annotation quality)
- Mutual iteration of data and model—diagnosing model errors and correcting data accordingly
- Human-centeredness of data practices (provenance, annotation guidelines, “data vision” documentation)
- Socio-technical accountability (fairness, privacy, explainability)
- Continuous, substantive expert engagement in annotation and evaluation (Jarrahi et al., 2022)
Associated metrics integral to DCAI pipelines include:
- Data Bias: KL divergence between empirical and target distributions.
- Noise Rate: Fraction of label errors.
- Representation Coverage and Edge-Case Density: Minimum subgroup representation, rare/ambiguous case proportion.
- Data Valuation: Influence functions, Shapley values, Cleanlab/epistemic/aleatoric uncertainty (Hansen et al., 2023).
In documentation, pipeline-level cards (DAG Cards) that capture the full directed acyclic graph of data transformations, data sources, hyperparameters, and output metrics are recommended for full reproducibility and auditability (Tagliabue et al., 2021).
5. Key Applications, Case Studies, and Modalities
DCAI has demonstrated major impact and design nuances across several domains:
- Transformer-based Time Series Forecasting: Performance is as dependent on preprocessing and feature construction (lag, seasonality, normalization, sliding windows) as on model structure; innovations in embedding and uncertainty quantification are active frontiers (Xu et al., 2024).
- Synthetic Data Generation: Data-centric profiling (e.g., Cleanlab, Data-IQ) provides “difficulty profiles” for guiding generator training, increasing downstream utility and feature selection fidelity beyond mere statistical fidelity (Hansen et al., 2023).
- Streaming Data: Platforms like DataCI enable continuous DCAI in streaming ML, supporting partitioned ingestion, on-the-fly pipeline evaluation, and drift tracking (Zhang et al., 2023).
- Agricultural Mapping and Remote Sensing: DCAI pipelines explicitly combine core-set selection, confident noise detection, active learning, augmentation (ChessMix, domain mixup), self-supervised pretrained features, and spatially aware evaluation (Silva et al., 17 Oct 2025).
Across all modalities, DCAI strategies are data-type specific: tabular (RL and generative feature creation), images (label cleansing, augmentation), time series (temporal windowing and decomposition), LLMs (data-centric benchmarks, attribution/unlearning, in-context learning, and RAG (Xu et al., 2024)).
6. Current Challenges and Future Directions
Several technical and operational open challenges remain:
- Unified and Standardized Pipelines: Need for standardized protocols for data exploration, splitting, cleaning, and augmentation that are tailored to specific architectures, especially transformers (Xu et al., 2024, Zha et al., 2023).
- Uncertainty and Interpretability: Quantification and communication of both predictive uncertainty (CRPS, quantile loss) and data process uncertainty are underdeveloped (Xu et al., 2024).
- Sustainability: Monitoring and managing the carbon footprint of repeated (data-centric) training cycles is nascent; empirical measurement of emissions is recommended as a core evaluation axis (Xu et al., 2024).
- Automation and LLM Integration: Autonomous “data engineering agents” (e.g., Co-STEER) combining scheduling of data-centric methods with implementation feedback loops represent a frontier for automating data curation and transformation at scale (Yang et al., 2024).
- Multimodal and Dynamic Data: Extending DCAI tools across structured, unstructured, and streaming modalities, including open-ended LLM data pipelines with at-scale attribution and unlearning (Xu et al., 2024).
- Evaluation Benchmarks and Governance: There is a lack of comprehensive DCAI benchmarks akin to MLPerf, and insufficient dataset governance standards for multi-organizational data sharing and federated preparation (Zha et al., 2023, Jakubik et al., 2022).
7. Significance and Socio-Technical Impact
DCAI represents a paradigm shift in AI and machine learning, recentering engineering and research effort towards the systematic, auditable, and value-driven management of data. In domains ranging from healthcare, finance, and agriculture, to fundamental LLM research and deployment, DCAI practices have increased robustness, fairness, generalization, and transparency of AI systems, promoting not only technical excellence but also alignment with societal and ethical imperatives (Jarrahi et al., 2022, Jakubik et al., 2022, Zha et al., 2023, Xu et al., 2024).
References
- (Jarrahi et al., 2022)
- (Zha et al., 2023)
- (Tagliabue et al., 2021)
- (Xu et al., 2024)
- (Park et al., 2024)
- (Chen et al., 2023)
- (Ying et al., 12 Feb 2025)
- (Silva et al., 17 Oct 2025)
- (Whang et al., 2021)
- (Polyzotis et al., 2021)
- (Jakubik et al., 2022)
- (Zha et al., 2023)
- (Zhang et al., 2023)
- (Hansen et al., 2023)
- (Yang et al., 2024)
- (Xu et al., 2024)