Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

2 1.0k

Data-centric Artificial Intelligence: A Survey (2303.10158v3)

Published 17 Mar 2023 in cs.LG, cs.AI, and cs.DB

Abstract: AI is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

References (298)

Authors (7)

Daochen Zha (56 papers)
Zaid Pervaiz Bhat (5 papers)
Kwei-Herng Lai (24 papers)
Fan Yang (878 papers)
Zhimeng Jiang (33 papers)
Shaochen Zhong (15 papers)
Xia Hu (186 papers)

Citations (143)

View on Semantic Scholar

Summary

The paper presents a novel taxonomy for Data-centric AI by categorizing tasks into Training Data Development, Inference Data Development, and Data Maintenance.
It demonstrates that systematic data engineering—supported by automation and human collaboration—is crucial for enhancing AI performance and deployment speed.
The survey reviews 36 benchmarks and outlines future challenges, emphasizing the shift from model-centric to data-centric strategies in AI systems.

This paper, "Data-centric Artificial Intelligence: A Survey" (Zha et al., 2023 ), provides a comprehensive overview of the emerging field of Data-centric AI (DCAI). It highlights a shift in focus from solely improving machine learning models (model-centric AI) to systematically engineering the data used to build AI systems. The authors argue that while model advancements have been significant, the quality and quantity of data are vital enablers of AI success. DCAI emphasizes the systematic iteration and improvement of data throughout the AI lifecycle to achieve better performance, faster deployment, and more reliable systems.

The survey proposes a goal-driven taxonomy for DCAI, dividing tasks into three main goals: Training Data Development, Inference Data Development, and Data Maintenance. It also analyzes existing methods from the perspectives of automation and human participation (collaboration).

Training Data Development

This goal focuses on collecting and producing high-quality data for model training. Key sub-goals and tasks include:

Data Collection: Gathering raw data. Efficient strategies include dataset discovery (finding relevant datasets in data lakes), data integration (combining data from different sources, often involving schema matching and value transformation), and raw data synthesis (generating data with desired patterns, e.g., synthetic anomalies). Domain knowledge is crucial here. Practical implementations leverage graph-based methods or machine learning for discovery and integration, and programmatic or learning-based techniques for synthesis.
Data Labeling: Assigning labels to data. This is essential for supervised learning and fine-tuning unsupervised models. Efficient strategies reduce human effort:
- Crowdsourced labeling: Distributing tasks to many annotators, with methods to improve consistency and quality (e.g., consensus labeling, iterative refinement). Requires full human participation but with technological assistance.
- Semi-supervised labeling: Using small labeled sets to infer labels for large unlabeled sets (e.g., self-training, graph-based methods, reinforcement learning from human feedback). Requires partial human participation for initial labels or feedback.
- Active Learning: Iteratively selecting the most informative unlabeled samples for human annotation, often focusing on samples where the model is uncertain. Requires continuous, partial human participation.
- Data Programming: Inferring labels using human-defined heuristic functions (labeling functions). Can require minimum or partial human participation depending on the need for interactive refinement. Snorkel [ratner2017snorkel] is a notable system for this.
- Distant Supervision: Automatically assigning labels based on external knowledge sources. An automated approach but can result in noisy labels.
Data Preparation: Cleaning and transforming raw data.
- Data Cleaning: Identifying and correcting errors (missing values, duplicates, inconsistencies). Ranges from programmatic heuristics (mean/median imputation) to learning-based methods (predictive imputation, duplicate estimation) and collaborative approaches involving human-machine workflows. Automated search for optimal cleaning strategies exists.
- Feature Extraction: Deriving relevant features from raw data. Can be domain-specific and programmatic (e.g., texture features for images) or automated using deep learning models (e.g., CNNs). Deep learning extractors blur the data/model boundary but can be uninterpretable or amplify bias.
- Feature Transformation: Converting features into a suitable format (e.g., normalization, standardization, log transformation). Can be programmatic or learning-based (e.g., using reinforcement learning to search for optimal transformations).
Data Reduction: Decreasing data complexity while retaining essential information.
- Feature Selection: Choosing a subset of relevant features (filter, wrapper, embedded methods). Can be programmatic, learning-based, or collaborative (active feature selection). Reduces dimensionality, improves efficiency, and can enhance interpretability.
- Dimensionality Reduction: Transforming high-dimensional features into a lower-dimensional space (e.g., PCA [abdi2010principal], LDA [xanthopoulos2013linear], autoencoders [bank2020autoencoders]). Typically automated learning-based methods.
- Instance Selection: Selecting a representative subset of samples (filter or wrapper methods). Can be programmatic (e.g., random undersampling [prusa2015using]) or learning-based (e.g., using reinforcement learning for undersampling [liu2020mesa]). Useful for efficiency and handling class imbalance.
Data Augmentation: Artificially increasing data size and diversity.
- Basic Manipulation: Making minor changes to existing data (e.g., rotation, scaling, Mixup [zhang2018mixup] for images, permutation/jittering for time series). Can be programmatic or learning-based (e.g., AutoAugment [cubuk2019autoaugment] searches for policies).
- Augmentation Data Synthesis: Generating new samples by learning data distribution (e.g., GANs [goodfellow2020generative], VAEs [hsu2017unsupervised], diffusion models [ho2020denoising, ho2022cascaded]). Typically learning-based.
- Upsampling: Specifically augmenting minority classes to address imbalance (e.g., SMOTE [chawla2002smote], ADASYN [he2008adasyn], learning-based methods like AutoSMOTE [zha2022towards]). Can be programmatic or learning-based.
Pipeline Search: Automatically searching for optimal combinations of sequential data processing tasks (e.g., AutoSklearn [feurer2015efficient], AlphaD3M [drori2021alphad3m], Deepline [heffetz2020deepline]). A trend towards automating the end-to-end data preparation workflow.

Inference Data Development

This goal involves creating data to evaluate trained models or unlock model capabilities.

In-distribution Evaluation: Generating samples conforming to the training distribution for detailed model assessment.
- Data Slicing: Partitioning data into sub-populations (slices) to evaluate performance on specific groups. Can be manual (based on predefined criteria) or automated (e.g., SliceFinder [chung2019slice] discovers problematic slices where the model performs poorly).
- Algorithmic Recourse: Generating hypothetical samples that would change a model's decision (counterfactuals). Helps understand decision boundaries and fairness. Methods vary based on model access (white-box vs. black-box) and often involve optimization or search. Requires minimal human participation (user specifies desired outcome).
Out-of-distribution Evaluation: Generating samples differing from the training distribution to assess robustness and generalizability.
- Generating Adversarial Samples: Creating inputs intentionally designed to cause incorrect predictions (e.g., adding perturbations). Ranges from manual perturbations to automated white-box, black-box, or poisoning attacks using optimization or learning-based methods. Crucial for understanding model security.
- Generating Samples with Distribution Shift: Creating evaluation sets where the data distribution changes (e.g., covariate shift, label shift, or general shift). Can involve collecting real-world data with shifts [koh2021wilds] or synthesizing data with specific shifts.
Prompt Engineering: Designing effective input prompts for large models to achieve desired outputs without model fine-tuning. Can be manual template creation or automated using programmatic methods (mining corpora) or learning-based methods (gradient-based search, generative models).

Data Maintenance

This goal focuses on ensuring data quality and reliability in dynamic environments.

Data Understanding: Gaining insights into complex data.
- Data Visualization: Presenting data graphically for human comprehension (visual summarization like charts, clustering for visualization, automated visualization recommendation). Can be manual or automated with varying degrees of human feedback.
- Data Valuation: Quantifying the contribution of individual data points to model performance (e.g., using Shapley values). Typically involves learning-based algorithms for efficient estimation.
Data Quality Assurance: Monitoring and improving data quality.
- Quality Assessment: Developing metrics to measure data quality (e.g., accuracy, timeliness, consistency, completeness - objective; trustworthiness, understandability - subjective). Objective metrics are typically collected with minimal human input, while subjective ones require more human participation.
- Quality Improvement: Strategies to enhance data quality (e.g., enforcing constraints, correcting errors). Ranges from programmatic automation to learning-based validation modules and pipeline automation. Collaborative approaches involve human feedback for continuous improvement.
Data Storage & Retrieval: Building efficient systems for data access.
- Resource Allocation: Managing memory and computational resources (e.g., optimizing throughput, latency). Can be programmatic (rule-based tuning) or learning-based (e.g., self-tuning systems like Starfish [herodotou2011starfish], OtterTune [van2017automatic]).
- Query Acceleration: Speeding up data retrieval. Includes query index selection (choosing optimal indexing schemes using programmatic or learning-based search) and query rewriting (optimizing queries by identifying repeated parts, using rule-based or learning-based methods).

Data Benchmarks

The paper surveys existing data benchmarks across these tasks, distinguishing them from model benchmarks. It analyzes 36 collected benchmarks, noting that the AI domain contributes the most, tabular and image data are the most benchmarked modalities, and training data development has received the most attention in terms of benchmarking.

Discussion and Future Directions

The survey answers its initial research questions, confirming the necessity of various DCAI tasks, the importance of automation (from programmatic to pipeline levels), and the essential role of human participation (from full to minimal) for aligning AI systems with human intentions. It highlights significant progress but also identifies open challenges.

Future directions include:

Cross-task Automation: Developing unified frameworks to automate tasks across different DCAI goals.
Data-Model Co-design: Jointly designing data strategies and models, recognizing the blurring boundary between data and models (especially with foundation models) and their co-evolution.
Debiasing Data: More research on mitigating biases in data through training data methods, creating evaluation data to expose unfairness, and maintaining fairness dynamically.
Tackling Data in Various Modalities: Focusing more research on modalities beyond tabular and image data, like time-series and graph data, which have unique challenges.
Data Benchmarks Development: Creating more unified and comprehensive benchmarks to accelerate research, similar to how model benchmarks have driven model-centric AI.

In conclusion, the paper posits that data will increasingly be central to building effective AI systems, but significant challenges remain, encouraging further research and collaborative initiatives in this field.

PDF Markdown

GitHub

GitHub - daochenzha/data-centric-AI: A curated, but incomplete, list of data-centric AI resources. (1,020 stars)

Tweets

https://twitter.com/DavidsonMa41390/status/1827445970755530910

YouTube

Show All Videos

Data-centric Artificial Intelligence: A Survey (2303.10158v3)

Summary

Related Papers

GitHub

Tweets

YouTube