Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-centric Artificial Intelligence: A Survey (2303.10158v3)

Published 17 Mar 2023 in cs.LG, cs.AI, and cs.DB

Abstract: AI is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

Definition Search Book Streamline Icon: https://streamlinehq.com
References (298)
  1. Rein: A comprehensive benchmark framework for data cleaning methods in ml pipelines. arXiv preprint arXiv:2302.04702 (2023).
  2. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.
  3. A marketplace for data: An algorithmic solution. In EC (2019).
  4. Effect of data scaling methods on machine learning algorithms and model performance. Technologies 9, 3 (2021), 52.
  5. Data normalization and standardization: a technical report. Mach Learn Tech Rep 1, 1 (2014), 1–6.
  6. Apache. Apache. https://storm.apache.org/releases/current/Performance.html (2023).
  7. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In CIDR (2021).
  8. Benchmarking data curation systems. IEEE Data Eng. Bull. 39, 2 (2016), 47–62.
  9. Data excellence for ai: why should you care? Interactions 29, 2 (2022), 66–69.
  10. Feature selection based on information gain. International Journal of Innovative Technology and Exploring Engineering (IJITEE) 2, 2 (2013), 18–21.
  11. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734 (2019).
  12. Bridging the semantic gap with sql query logs in natural language interfaces to databases. In ICDE (2019).
  13. Autoencoders. arXiv preprint arXiv:2003.05991 (2020).
  14. Tsfel: Time series feature extraction library. SoftwareX 11 (2020), 100456.
  15. Microsoft terraserver: a spatial data warehouse. In SIGMOD (2000).
  16. Barenstein, M. Propublica’s compas data revisited. arXiv preprint arXiv:1906.04711 (2019).
  17. Discovering implicit integrity constraints in rule bases using metagraphs. In HICSS (1995).
  18. Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR) 41, 3 (2009), 1–52.
  19. Tfx: A tensorflow-based production-scale machine learning platform. In KDD (2017).
  20. A step towards global counterfactual explanations: Approximating the feature space through hierarchical division and graph search. Adv. Artif. Intell. Mach. Learn. 1, 2 (2021), 90–110.
  21. A study on the evaluation of generative models. arXiv preprint arXiv:2206.10935 (2022).
  22. Datahub: Collaborative data science & dataset version management at scale. In CIDR (2015).
  23. Evasion attacks against machine learning at test time. In ECMLPKDD (2013).
  24. Introduction to scikit-learn. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019), 215–229.
  25. Comparison of instance selection and construction methods with various classifiers. Applied Sciences 10, 11 (2020), 3933.
  26. Blanchart, P. An exact counterfactual-example-based approach to tree-ensemble models interpretability. arXiv preprint arXiv:2105.14820 (2021).
  27. Interactive weak supervision: Learning useful heuristics for data labeling. In ICLR (2021).
  28. Dataset discovery in data lakes. In ICDE (2020).
  29. Conditional functional dependencies for data cleaning. In 2007 IEEE 23rd international conference on data engineering (2006), IEEE, pp. 746–755.
  30. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD (2008).
  31. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings in Bioinformatics 23, 1 (2022), bbab354.
  32. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49–e57.
  33. What makes a visualization memorable? IEEE transactions on visualization and computer graphics 19, 12 (2013), 2306–2315.
  34. Language models are few-shot learners. NeurIPS (2020).
  35. A feature extraction & selection benchmark for structural health monitoring. Structural Health Monitoring (2022), 14759217221111141.
  36. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT (2018).
  37. On the benefits and drawbacks of radial diagrams. Handbook of human centric visualization (2014), 429–451.
  38. Counterfactual explanations for oblique decision trees: Exact, efficient algorithms. In AAAI (2021).
  39. An efficient, cost-driven index selection tool for microsoft sql server. In VLDB (1997).
  40. Dbridge: A program rewrite tool for set-oriented query execution. In ICDE (2011).
  41. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.
  42. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In ACL (2020).
  43. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In AISec Workshop (2017).
  44. Building data curation processes with crowd intelligence. In CAiSE (2020).
  45. A review of medical image data augmentation techniques for deep learning applications. Journal of Medical Imaging and Radiation Oncology 65, 5 (2021), 545–563.
  46. Graph-based semi-supervised learning: A review. Neurocomputing 408 (2020), 216–230.
  47. Natural language processing. Fundamentals of artificial intelligence (2020), 603–649.
  48. Deep reinforcement learning from human preferences. In NeurIPS (2017).
  49. Discovering denial constraints. In VLDB (2013).
  50. Mitigating relational bias on knowledge graphs. arXiv preprint arXiv:2211.14489 (2022).
  51. Efficient xai techniques: A taxonomic survey. arXiv preprint arXiv:2302.03225 (2023).
  52. Cortx: Contrastive framework for real-time explanation. In ICLR (2023).
  53. Slice finder: Automated data slicing for model validation. In ICDE (2019).
  54. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge. Database 2017 (2017).
  55. Active learning with statistical models. Journal of artificial intelligence research 4 (1996), 129–145.
  56. Autoaugment: Learning augmentation policies from data. In CVPR (2019).
  57. Multi-objective counterfactual explanations. In PPSN (2020).
  58. Vox populi: Collecting high-quality labels from a crowd. In COLT (2009).
  59. Imagenet: A large-scale hierarchical image database. In CVPR (2009).
  60. A human-ml collaboration framework for improving video content reviews. arXiv preprint arXiv:2210.09500 (2022).
  61. Desnoyers, L. Toward a taxonomy of visuals in science communication. Technical Communication 58, 2 (2011), 119–134.
  62. Model agnostic contrastive explanations for structured data. arXiv preprint arXiv:1906.00117 (2019).
  63. Retiring adult: New datasets for fair machine learning. In NeurIPS (2021).
  64. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter 24, 2 (2022), 61–77.
  65. Fairly predicting graft failure in liver transplant for organ assigning. arXiv preprint arXiv:2302.09400 (2023).
  66. Active ensemble learning for knowledge graph error detection. In WSDM (2023).
  67. Benchmarking adversarial robustness on image classification. In CVPR (2020).
  68. Alphad3m: Machine learning pipeline synthesis. arXiv preprint arXiv:2111.02508 (2021).
  69. Tuning database configuration parameters with ituned. In VLDB (2009).
  70. Toward a quantitative survey of dimension reduction techniques. IEEE transactions on visualization and computer graphics 27, 3 (2019), 2153–2173.
  71. Robust physical-world attacks on deep learning visual classification. In CVPR (2018).
  72. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing 2, 3 (2014), 267–279.
  73. A brief review of domain adaptation. Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020 (2021), 877–894.
  74. A survey of data augmentation approaches for nlp. In ACL (2021).
  75. Aurum: A data discovery system. In ICDE (2018).
  76. Efficient and robust automated machine learning. In NeurIPS (2015).
  77. Foundation, A. S. Hadoop. https://hadoop.apache.org (2023).
  78. The science of visual data communication: What works. Psychological Science in the public interest 22, 3 (2021), 110–161.
  79. Synthetic data augmentation using gan for improved liver lesion classification. In ISBI (2018).
  80. Adaptive rule discovery for labeling text data. In SIGMOD (2021).
  81. Human-ai collaboration for improving the identification of cars for autonomous driving. In CIKM Workshop (2022).
  82. Making pre-trained language models better few-shot learners. In ACL (2021).
  83. A distributional framework for data valuation. In ICML (2020).
  84. Data shapley: Equitable valuation of data for machine learning. In ICML (2019).
  85. Amlb: an automl benchmark. arXiv preprint arXiv:2207.12560 (2022).
  86. Generative adversarial networks. Communications of the ACM 63, 11 (2020), 139–144.
  87. Covariate shift by kernel mean matching. Dataset shift in machine learning 3, 4 (2009), 5.
  88. Benchmark development for the evaluation of visualization for data mining. Information visualization in data mining and knowledge discovery (2002), 129–176.
  89. Comparison of instance selection algorithms ii. results and comments. In ICAISC (2004).
  90. Using videos to evaluate image model robustness. arXiv preprint arXiv:1904.10076 (2019).
  91. Domain adaptation for medical image analysis: a survey. IEEE Transactions on Biomedical Engineering 69, 3 (2021), 1173–1185.
  92. Hamilton, J. D. Time series analysis. Princeton university press, 2020.
  93. G-mixup: Graph data augmentation for graph classification. In ICML (2022).
  94. Bertese: Learning to speak to bert. In EACL (2021).
  95. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In WCCI (2008).
  96. Learning to rewrite queries. In CIKM (2016).
  97. Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering. In KDD (2020).
  98. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM (2014).
  99. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019).
  100. Starfish: A self-tuning system for big data analytics. In CIDR (2011).
  101. Denoising diffusion probabilistic models. In NeurIPS (2020).
  102. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 47 (2022), 1–33.
  103. Cut out the annotator, keep the cutout: better segmentation with weak supervision. In ICLR (2021).
  104. Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. In ASRU (2017).
  105. An empirical survey of data augmentation for time series classification with neural networks. Plos one 16, 7 (2021), e0254841.
  106. A benchmark for data imputation methods. Frontiers in big Data 4 (2021), 693674.
  107. Overview and importance of data quality for machine learning tasks. In KDD (2020).
  108. Data-centric artificial intelligence. arXiv preprint arXiv:2212.11854 (2022).
  109. The principles of data-centric ai (dcai). arXiv preprint arXiv:2211.14611 (2022).
  110. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In CVPR (2021).
  111. Weakly supervised anomaly detection: A survey. arXiv preprint arXiv:2302.04549 (2023).
  112. Fmp: Toward fair graph message passing against topology bias. arXiv preprint arXiv:2202.04187 (2022).
  113. Generalized demographic parity for group fairness. In ICLR (2022).
  114. Weight perturbation can help fairness under distribution shift. arXiv preprint arXiv:2303.03300 (2023).
  115. How can we know what language models know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438.
  116. An information fusion approach to learning with instance-dependent label noise. In ICLR (2022).
  117. Highly accurate protein structure prediction with alphafold. Nature 596, 7873 (2021), 583–589.
  118. Dace: Distribution-aware counterfactual explanation by mixed-integer linear optimization. In IJCAI (2020).
  119. Chart-to-text: A large-scale benchmark for chart summarization. arXiv preprint arXiv:2203.06486 (2022).
  120. Algorithmic recourse: from counterfactual explanations to interventions. In FAccT (2021).
  121. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL (2019).
  122. Feature engineering for predictive modeling using reinforcement learning. In AAAI (2018).
  123. Multiaccuracy: Black-box post-processing for fairness in classification. In AIES (2019).
  124. Variational diffusion models. In NeurIPS (2021).
  125. Wilds: A benchmark of in-the-wild distribution shifts. In ICML (2021).
  126. Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).
  127. Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84–90.
  128. To join or not to join? thinking twice about joins before feature selection. In SIGMOD (2016).
  129. Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112.
  130. Annotator rationales for labeling tasks in crowdsourcing. Journal of Artificial Intelligence Research 69 (2020), 143–189.
  131. Dual policy distillation. In IJCAI (2020).
  132. Tods: An automated time series outlier detection system. In AAAI (2021).
  133. Revisiting time series outlier detection: Definitions and benchmarks. In NeurIPS (2021).
  134. Policy-gnn: Aggregation optimization for graph neural networks. In KDD (2020).
  135. Imputation of missing data using machine learning techniques. In KDD (1996).
  136. Comparison-based inverse classification for interpretability in machine learning. In IPMU (2018).
  137. Lenzerini, M. Data integration: A theoretical perspective. In PODS (2002).
  138. Feature selection: A data perspective. ACM computing surveys (CSUR) 50, 6 (2017), 1–45.
  139. Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483 (2019), 75.
  140. Tts-gan: A transformer-based time-series generative adversarial network. In AIME (2022).
  141. Towards learning disentangled representations for time series. In KDD (2022).
  142. Automated anomaly detection via curiosity-guided search and self-imitation learning. IEEE Transactions on Neural Networks and Learning Systems 33, 6 (2021), 2365–2377.
  143. Autood: Neural architecture search for outlier detection. In ICDE (2021).
  144. Pyodds: An end-to-end outlier detection system with automated machine learning. In WWW (2020).
  145. Detecting and correcting for label shift with black box predictors. In ICML (2018).
  146. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 9 (2023), 1–35.
  147. Rsc: Accelerating graph neural networks training via randomized sparse computations. arXiv preprint arXiv:2210.10737 (2022).
  148. Mesa: boost ensemble imbalanced learning with meta-sampler. In NeurIPS (2020).
  149. Focus: Flexible optimizable counterfactual explanations for tree ensembles. In AAAI (2022).
  150. Deepeye: Towards automatic data visualization. In 2018 IEEE 34th international conference on data engineering (ICDE) (2018), IEEE, pp. 101–112.
  151. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  152. Management, C. P. Clouderayarntuning. https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_yarn_tuning.html (2023).
  153. Benchmarking learned indexes. In VLDB (2020).
  154. Towards personalized preprocessing pipeline search. arXiv preprint arXiv:2302.14329 (2023).
  155. Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062 (2022).
  156. A comprehensive benchmark framework for active learning methods in entity matching. In SIGMOD (2020).
  157. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.
  158. Interpretability and fairness evaluation of deep learning models on mimic-iv dataset. Scientific Reports 12, 1 (2022), 7166.
  159. On evaluation of automl systems. In ICML Workshop (2020).
  160. Distant supervision for relation extraction without labeled data. In ACL (2009).
  161. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 19, 6 (2018), 1236–1246.
  162. Miranda, L. J. Towards data-centric machine learning: a short review. ljvmiranda921.github.io (2021).
  163. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic acids research 45, D1 (2017), D170–D176.
  164. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  165. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR (2016).
  166. Comparison of different image data augmentation approaches. Journal of imaging 7, 12 (2021), 254.
  167. Table union search on open data. In VLDB (2018).
  168. Ng, A. Data-centric ai resource hub. Snorkel AI. Available online: https://snorkel.ai/(accessed on 8 February 2023) (2021).
  169. Ng, A. Landing ai. Landing AI. Available online: https://landing.ai/(accessed on 8 February 2023) (2023).
  170. Data-centric ai competition. DeepLearning AI. Available online: https://https-deeplearning-ai. github. io/data-centric-comp/(accessed on 8 December 2021) (2021).
  171. Quality assessment method for gan based on modified metrics inception score and fréchet inception distance. In CoMeSySo (2020).
  172. OpenAI. Gpt-4 technical report, 2023.
  173. Mind the performance gap: examining dataset shift during prospective validation. In MLHC (2021).
  174. Training language models to follow instructions with human feedback. In NeurIPS (2022).
  175. Deep learning for financial applications: A survey. Applied Soft Computing 93 (2020), 106384.
  176. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR) 54, 2 (2021), 1–38.
  177. Practical black-box attacks against machine learning. In ASIACCS (2017).
  178. Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms. arXiv preprint arXiv:2108.00783 (2021).
  179. An adaptive approach for index tuning with learning classifier systems on hybrid storage environments. In HAIS (2018).
  180. Rodi: A benchmark for automatic mapping generation in relational-to-ontology data integration. In ESWC (2015).
  181. Data quality assessment. Communications of the ACM 45, 4 (2002), 211–218.
  182. Tpc-di: the first industry benchmark for data integration. In VLDB (2014).
  183. What can data-centric ai learn from data and ml engineering? arXiv preprint arXiv:2112.06439 (2021).
  184. Face: feasible and actionable counterfactual explanations. In AAAI (2020).
  185. Press, G. Cleaning big data: Most time-consuming, least enjoyable data science task, survey says, Oct 2022.
  186. Using random undersampling to alleviate class imbalance on tweet sentiment data. In IRI (2015).
  187. Improving language understanding by generative pre-training. OpenAI (2018).
  188. Language models are unsupervised multitask learners. OpenAI (2019).
  189. Ratner, A. Scale ai. Snorkel AI. Available online: https://snorkel.ai/(accessed on 8 February 2023) (2023).
  190. Snorkel: Rapid training data creation with weak supervision. In VLDB (2017).
  191. Data programming: Creating large training sets, quickly. NeurIPS (2016).
  192. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1–40.
  193. Finding representative patterns with ordered projections. pattern recognition 36, 4 (2003), 1009–1018.
  194. High-resolution image synthesis with latent diffusion models. In CVPR (2022).
  195. Data quality: The role of empiricism. ACM SIGMOD Record 46, 4 (2018), 35–43.
  196. Online index selection using deep reinforcement learning for a cluster database. In ICDE Workshop (2020).
  197. Adapting visual category models to new domains. In ECCV (2010).
  198. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. In SIGMOD (2021).
  199. Feature extraction: a survey of the types, techniques, applications. In ICSC (2019).
  200. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In CHI (2021).
  201. Scribbler: Controlling deep image synthesis with sketch and color. In CVPR (2017).
  202. Quantitative program slicing: Separating statements by relevance. In ICSE (2013).
  203. Saporta, G. Data fusion and data grafting. Computational statistics & data analysis 38, 4 (2002), 465–473.
  204. Automating large-scale data quality verification. In VLDB (2018).
  205. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 (2020).
  206. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926 (2020).
  207. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118 (2020).
  208. Active feature selection for the mutual information criterion. In AAAI (2021).
  209. Dc-check: A data-centric ai checklist to guide the development of reliable machine learning systems. arXiv preprint arXiv:2211.05764 (2022).
  210. Poison frogs! targeted clean-label poisoning attacks on neural networks. In NeurIPS (2018).
  211. Do image classifiers generalize across time? In ICCV (2021).
  212. Certifai: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models. arXiv preprint arXiv:1905.07857 (2019).
  213. Towards natural language interfaces for data visualization: A survey. arXiv preprint arXiv:2109.03506 (2021).
  214. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624 (2021).
  215. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
  216. Text data augmentation for deep learning. Journal of big Data 8 (2021), 1–34.
  217. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In NeurIPS (2020).
  218. Data mining and machine learning to promote smart cities: A systematic review from 2000 to 2018. Sustainability 11, 4 (2019), 1077.
  219. Snowy: Recommending utterances for conversational visual analysis. In SIGCHI (2021).
  220. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
  221. Data curation at scale: the data tamer system. In CIDR (2013).
  222. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 41, 2 (2018), 3–9.
  223. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, 5 (2007).
  224. An end-to-end learning-based cost estimator. In VLDB (2019).
  225. Sutton, O. Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction. University lectures, University of Leicester 1 (2012).
  226. Bring your own view: Graph neural networks for link prediction with personalized subgraph selection. In WSDM (2023).
  227. Semi-supervised consensus labeling for crowdsourcing. In SIGIR Workshop (2011).
  228. Benchmarking differentially private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238 (2021).
  229. Intrusion detection model using fusion of chi-square feature selection and multi class svm. Journal of King Saud University-Computer and Information Sciences 29, 4 (2017), 462–472.
  230. Data curation with deep learning. In EDBT (2020).
  231. Data warehousing and analytics infrastructure at facebook. In SIGMOD (2010).
  232. Db2 advisor: An optimizer smart enough to recommend its own indexes. In ICDE (2000).
  233. Automatic database management system tuning through large-scale machine learning. In SIGMOD (2017).
  234. Overview of amazon web services. Amazon Web Services (2014).
  235. Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods. Briefings in Bioinformatics 23, 5 (2022), bbac315.
  236. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).
  237. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech. 31 (2017), 841.
  238. A comparison of radial and linear charts for visualizing daily patterns. IEEE transactions on visualization and computer graphics 26, 1 (2019).
  239. Universal adversarial triggers for attacking and analyzing nlp. In IJCNLP (2019).
  240. In-processing modeling techniques for machine learning fairness: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) (2022).
  241. Wang, A. Scale ai. Scale AI. Available online: https://scale.com/(accessed on 8 February 2023) (2023).
  242. Bed: A real-time object detection system for edge devices. In CIKM (2022), pp. 4994–4998.
  243. Accelerating shapley explanation via contributive cooperator selection. In ICML (2022).
  244. Crowder: crowdsourcing entity resolution. In VLDB (2012).
  245. Embedded unsupervised feature selection. In AAAI (2015).
  246. Usb: A unified semi-supervised learning benchmark for classification. In NeurIPS (2022).
  247. A crowdsourcing open platform for literature curation in uniprot. PLoS biology 19, 12 (2021), e3001464.
  248. Time series classification from scratch with deep neural networks: A strong baseline. In IJCNN (2017).
  249. Deep learning for biology. Nature 554, 7693 (2018), 555–557.
  250. Time series data augmentation for deep learning: A survey. In IJCAI (2021).
  251. Data collection and quality challenges in deep learning: A data-centric ai perspective. In VLDB (2023).
  252. White, T. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
  253. Winston, P. H. Artificial intelligence. Addison-Wesley Longman Publishing Co., Inc., 1984.
  254. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. IEEE transactions on visualization and computer graphics 22, 1 (2015), 649–658.
  255. Linear discriminant analysis. Robust data mining (2013), 27–33.
  256. Fairness-aware unsupervised feature selection. In CIKM (2021).
  257. Knowledge graph quality management: a comprehensive survey. IEEE Transactions on Knowledge and Data Engineering (2022).
  258. Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sensors and Actuators B: Chemical 212 (2015), 353–363.
  259. A benchmark and comparison of active learning for logistic regression. Pattern Recognition 83 (2018), 401–415.
  260. Ying, X. An overview of overfitting and its solutions. Journal of physics: Conference series 1168 (2019), 022022.
  261. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In CVPR (2020).
  262. Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples. arXiv preprint arXiv:2009.06368 (2020).
  263. Gpt3mix: Leveraging large-scale language models for text augmentation. In EMNLP (2021).
  264. Bartscore: Evaluating generated text as text generation. In NeurIPS (2021).
  265. A survey of crowdsourcing systems. In PASSAT (2011).
  266. Apache Spark: A unified engine for big data processing. Communications of the ACM 59 (2016).
  267. Stratal slicing, part ii: Real 3-d seismic data. Geophysics 63, 2 (1998), 514–522.
  268. An evaluation-focused framework for visualization recommendation algorithms. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 346–356.
  269. Data-centric ai: Perspectives and challenges. arXiv preprint arXiv:2301.04819 (2023).
  270. Autoshard: Automated embedding table sharding for recommender systems. In KDD (2022).
  271. Dreamshard: Generalizable embedding table placement for recommender systems. In NeurIPS (2022).
  272. Rlcard: a platform for reinforcement learning in card games. In IJCAI (2021).
  273. Towards automated imbalanced learning with deep hierarchical reinforcement learning. In CIKM (2022).
  274. Meta-aad: Active anomaly detection with deep reinforcement learning. In ICDM (2020).
  275. Experience replay optimization. In IJCAI (2019).
  276. Simplifying deep reinforcement learning via self-supervision. arXiv preprint arXiv:2106.05526 (2021).
  277. Towards similarity-aware time-series classification. In SDM (2022).
  278. Multi-label dataless text classification with topic modeling. Knowledge and Information Systems 61 (2019), 137–160.
  279. Rank the episodes: A simple approach for exploration in procedurally-generated environments. In ICLR (2021).
  280. Autovideo: An automated video action recognition system. In IJCAI (2022).
  281. Douzero: Mastering doudizhu with self-play deep reinforcement learning. In ICML (2021).
  282. mixup: Beyond empirical risk minimization. In ICLR (2018).
  283. Self-attention generative adversarial networks. In IICML (2019).
  284. A survey on programmatic weak supervision. arXiv preprint arXiv:2202.05433 (2022).
  285. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR) 52, 1 (2019), 1–38.
  286. Facilitating database tuning with hyper-parameter optimization: a comprehensive experimental evaluation. In VLDB (2022).
  287. Active incremental feature selection using a fuzzy-rough-set-based information entropy. IEEE Transactions on Fuzzy Systems 28, 5 (2019), 901–915.
  288. Character-level convolutional networks for text classification. In NeurIPS (2015).
  289. Zhang, Z. Missing data imputation: focusing on single imputation. Annals of translational medicine 4, 1 (2016).
  290. Graph neural networks: A review of methods and applications. AI open 1 (2020), 57–81.
  291. Towards deeper graph neural networks with differentiable group normalization. In NeurIPS (2020).
  292. Dirichlet energy constrained learning for deep graph neural networks. In NeurIPS (2021).
  293. Multi-channel graph neural networks. In IJCAI (2021).
  294. Dbmind: A self-driving platform in opengauss. In VLDB (2021).
  295. Democratic co-learning. In ICTAI (2004).
  296. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In CVPR (2015).
  297. Benchmark and survey of automated machine learning frameworks. Journal of artificial intelligence research 70 (2021), 409–472.
  298. Rethinking pre-training and self-training. In NeurIPS (2020).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Daochen Zha (56 papers)
  2. Zaid Pervaiz Bhat (5 papers)
  3. Kwei-Herng Lai (24 papers)
  4. Fan Yang (878 papers)
  5. Zhimeng Jiang (33 papers)
  6. Shaochen Zhong (15 papers)
  7. Xia Hu (186 papers)
Citations (143)

Summary

  • The paper presents a novel taxonomy for Data-centric AI by categorizing tasks into Training Data Development, Inference Data Development, and Data Maintenance.
  • It demonstrates that systematic data engineering—supported by automation and human collaboration—is crucial for enhancing AI performance and deployment speed.
  • The survey reviews 36 benchmarks and outlines future challenges, emphasizing the shift from model-centric to data-centric strategies in AI systems.

This paper, "Data-centric Artificial Intelligence: A Survey" (Zha et al., 2023 ), provides a comprehensive overview of the emerging field of Data-centric AI (DCAI). It highlights a shift in focus from solely improving machine learning models (model-centric AI) to systematically engineering the data used to build AI systems. The authors argue that while model advancements have been significant, the quality and quantity of data are vital enablers of AI success. DCAI emphasizes the systematic iteration and improvement of data throughout the AI lifecycle to achieve better performance, faster deployment, and more reliable systems.

The survey proposes a goal-driven taxonomy for DCAI, dividing tasks into three main goals: Training Data Development, Inference Data Development, and Data Maintenance. It also analyzes existing methods from the perspectives of automation and human participation (collaboration).

Training Data Development

This goal focuses on collecting and producing high-quality data for model training. Key sub-goals and tasks include:

  • Data Collection: Gathering raw data. Efficient strategies include dataset discovery (finding relevant datasets in data lakes), data integration (combining data from different sources, often involving schema matching and value transformation), and raw data synthesis (generating data with desired patterns, e.g., synthetic anomalies). Domain knowledge is crucial here. Practical implementations leverage graph-based methods or machine learning for discovery and integration, and programmatic or learning-based techniques for synthesis.
  • Data Labeling: Assigning labels to data. This is essential for supervised learning and fine-tuning unsupervised models. Efficient strategies reduce human effort:
    • Crowdsourced labeling: Distributing tasks to many annotators, with methods to improve consistency and quality (e.g., consensus labeling, iterative refinement). Requires full human participation but with technological assistance.
    • Semi-supervised labeling: Using small labeled sets to infer labels for large unlabeled sets (e.g., self-training, graph-based methods, reinforcement learning from human feedback). Requires partial human participation for initial labels or feedback.
    • Active Learning: Iteratively selecting the most informative unlabeled samples for human annotation, often focusing on samples where the model is uncertain. Requires continuous, partial human participation.
    • Data Programming: Inferring labels using human-defined heuristic functions (labeling functions). Can require minimum or partial human participation depending on the need for interactive refinement. Snorkel [ratner2017snorkel] is a notable system for this.
    • Distant Supervision: Automatically assigning labels based on external knowledge sources. An automated approach but can result in noisy labels.
  • Data Preparation: Cleaning and transforming raw data.
    • Data Cleaning: Identifying and correcting errors (missing values, duplicates, inconsistencies). Ranges from programmatic heuristics (mean/median imputation) to learning-based methods (predictive imputation, duplicate estimation) and collaborative approaches involving human-machine workflows. Automated search for optimal cleaning strategies exists.
    • Feature Extraction: Deriving relevant features from raw data. Can be domain-specific and programmatic (e.g., texture features for images) or automated using deep learning models (e.g., CNNs). Deep learning extractors blur the data/model boundary but can be uninterpretable or amplify bias.
    • Feature Transformation: Converting features into a suitable format (e.g., normalization, standardization, log transformation). Can be programmatic or learning-based (e.g., using reinforcement learning to search for optimal transformations).
  • Data Reduction: Decreasing data complexity while retaining essential information.
    • Feature Selection: Choosing a subset of relevant features (filter, wrapper, embedded methods). Can be programmatic, learning-based, or collaborative (active feature selection). Reduces dimensionality, improves efficiency, and can enhance interpretability.
    • Dimensionality Reduction: Transforming high-dimensional features into a lower-dimensional space (e.g., PCA [abdi2010principal], LDA [xanthopoulos2013linear], autoencoders [bank2020autoencoders]). Typically automated learning-based methods.
    • Instance Selection: Selecting a representative subset of samples (filter or wrapper methods). Can be programmatic (e.g., random undersampling [prusa2015using]) or learning-based (e.g., using reinforcement learning for undersampling [liu2020mesa]). Useful for efficiency and handling class imbalance.
  • Data Augmentation: Artificially increasing data size and diversity.
    • Basic Manipulation: Making minor changes to existing data (e.g., rotation, scaling, Mixup [zhang2018mixup] for images, permutation/jittering for time series). Can be programmatic or learning-based (e.g., AutoAugment [cubuk2019autoaugment] searches for policies).
    • Augmentation Data Synthesis: Generating new samples by learning data distribution (e.g., GANs [goodfellow2020generative], VAEs [hsu2017unsupervised], diffusion models [ho2020denoising, ho2022cascaded]). Typically learning-based.
    • Upsampling: Specifically augmenting minority classes to address imbalance (e.g., SMOTE [chawla2002smote], ADASYN [he2008adasyn], learning-based methods like AutoSMOTE [zha2022towards]). Can be programmatic or learning-based.
  • Pipeline Search: Automatically searching for optimal combinations of sequential data processing tasks (e.g., AutoSklearn [feurer2015efficient], AlphaD3M [drori2021alphad3m], Deepline [heffetz2020deepline]). A trend towards automating the end-to-end data preparation workflow.

Inference Data Development

This goal involves creating data to evaluate trained models or unlock model capabilities.

  • In-distribution Evaluation: Generating samples conforming to the training distribution for detailed model assessment.
    • Data Slicing: Partitioning data into sub-populations (slices) to evaluate performance on specific groups. Can be manual (based on predefined criteria) or automated (e.g., SliceFinder [chung2019slice] discovers problematic slices where the model performs poorly).
    • Algorithmic Recourse: Generating hypothetical samples that would change a model's decision (counterfactuals). Helps understand decision boundaries and fairness. Methods vary based on model access (white-box vs. black-box) and often involve optimization or search. Requires minimal human participation (user specifies desired outcome).
  • Out-of-distribution Evaluation: Generating samples differing from the training distribution to assess robustness and generalizability.
    • Generating Adversarial Samples: Creating inputs intentionally designed to cause incorrect predictions (e.g., adding perturbations). Ranges from manual perturbations to automated white-box, black-box, or poisoning attacks using optimization or learning-based methods. Crucial for understanding model security.
    • Generating Samples with Distribution Shift: Creating evaluation sets where the data distribution changes (e.g., covariate shift, label shift, or general shift). Can involve collecting real-world data with shifts [koh2021wilds] or synthesizing data with specific shifts.
  • Prompt Engineering: Designing effective input prompts for large models to achieve desired outputs without model fine-tuning. Can be manual template creation or automated using programmatic methods (mining corpora) or learning-based methods (gradient-based search, generative models).

Data Maintenance

This goal focuses on ensuring data quality and reliability in dynamic environments.

  • Data Understanding: Gaining insights into complex data.
    • Data Visualization: Presenting data graphically for human comprehension (visual summarization like charts, clustering for visualization, automated visualization recommendation). Can be manual or automated with varying degrees of human feedback.
    • Data Valuation: Quantifying the contribution of individual data points to model performance (e.g., using Shapley values). Typically involves learning-based algorithms for efficient estimation.
  • Data Quality Assurance: Monitoring and improving data quality.
    • Quality Assessment: Developing metrics to measure data quality (e.g., accuracy, timeliness, consistency, completeness - objective; trustworthiness, understandability - subjective). Objective metrics are typically collected with minimal human input, while subjective ones require more human participation.
    • Quality Improvement: Strategies to enhance data quality (e.g., enforcing constraints, correcting errors). Ranges from programmatic automation to learning-based validation modules and pipeline automation. Collaborative approaches involve human feedback for continuous improvement.
  • Data Storage & Retrieval: Building efficient systems for data access.
    • Resource Allocation: Managing memory and computational resources (e.g., optimizing throughput, latency). Can be programmatic (rule-based tuning) or learning-based (e.g., self-tuning systems like Starfish [herodotou2011starfish], OtterTune [van2017automatic]).
    • Query Acceleration: Speeding up data retrieval. Includes query index selection (choosing optimal indexing schemes using programmatic or learning-based search) and query rewriting (optimizing queries by identifying repeated parts, using rule-based or learning-based methods).

Data Benchmarks

The paper surveys existing data benchmarks across these tasks, distinguishing them from model benchmarks. It analyzes 36 collected benchmarks, noting that the AI domain contributes the most, tabular and image data are the most benchmarked modalities, and training data development has received the most attention in terms of benchmarking.

Discussion and Future Directions

The survey answers its initial research questions, confirming the necessity of various DCAI tasks, the importance of automation (from programmatic to pipeline levels), and the essential role of human participation (from full to minimal) for aligning AI systems with human intentions. It highlights significant progress but also identifies open challenges.

Future directions include:

  • Cross-task Automation: Developing unified frameworks to automate tasks across different DCAI goals.
  • Data-Model Co-design: Jointly designing data strategies and models, recognizing the blurring boundary between data and models (especially with foundation models) and their co-evolution.
  • Debiasing Data: More research on mitigating biases in data through training data methods, creating evaluation data to expose unfairness, and maintaining fairness dynamically.
  • Tackling Data in Various Modalities: Focusing more research on modalities beyond tabular and image data, like time-series and graph data, which have unique challenges.
  • Data Benchmarks Development: Creating more unified and comprehensive benchmarks to accelerate research, similar to how model benchmarks have driven model-centric AI.

In conclusion, the paper posits that data will increasingly be central to building effective AI systems, but significant challenges remain, encouraging further research and collaborative initiatives in this field.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com