Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective (2112.06409v3)

Published 13 Dec 2021 in cs.LG
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Abstract: Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

This paper presents a comprehensive examination of the data collection and quality challenges in deep learning, focusing on a data-centric AI paradigm. The authors argue that, in this paradigm, data assumes a primary role equivalent to code in traditional software development, necessitating a reevaluation of software engineering principles.

Core Challenges in Data-Centric AI

The paper identifies several key challenges in data-centric AI, particularly within the field of deep learning. These challenges are primarily associated with data collection, data quality, bias, and fairness, which are critical for the efficacy of machine learning models. The authors emphasize that the quality of input data significantly influences the performance of deep learning models and that a considerable portion of machine learning efforts are expended on data preparation.

Data Collection and Quality

  1. Data Collection: The paper discusses the importance of efficient data collection for deep learning as feature engineering becomes less critical with recent advancements. The challenge lies in gathering large datasets that are necessary for training deep learning models, especially when feature engineering is minimal. Data collection strategies include data discovery, data augmentation with generative techniques like GANs, and synthetic data creation.
  2. Data Quality: Key issues in data quality are addressed, including validation, cleaning, and integration, which remain central to ensuring the robustness and reliability of deep learning models. The paper discusses data validation techniques involving schema-based and statistical checks, data cleaning methods tailored for enhancing model performance, and mechanisms for coping with imperfect data through robust model training techniques.
  3. Bias and Fairness: Modern AI applications necessitate fairness and unbiased data, as biases in datasets can propagate into model predictions. This has spurred interest in research around fairness measures and mitigative strategies that can be applied across different stages of the machine learning process. The paper acknowledges the growing importance of ensuring ethical AI practices and the potential role of the data management community in addressing these challenges.

Significant Contributions

In addition to outlining these challenges, the paper surveys existing research and techniques addressing data quality issues for deep learning. It explores specific methodologies for data validation, cleaning, and robust model training, further highlighting the criticality of fairness in addressing data biases. The researchers argue for the data management community's involvement in solving these emerging issues, proposing a fundamentally data-centric approach as a strategic necessity in contemporary AI research.

Implications and Future Directions

In contemplating the practical and theoretical implications of their findings, the authors speculate on future trends in AI, such as the convergence of robust and fair training methods and holistic frameworks to orchestrate various techniques for improving data quality. They surmise that future progress in data-centric AI will require the integration of data management strategies into the entire AI lifecycle—from data collection to model deployment—emphasizing the need to incorporate fairness and robustness considerations into all stages.

This paper underscores the critical role of high-quality data in deep learning and highlights emerging data-centric strategies and methodologies designed to address quality, bias, and robustness concerns in AI applications. By focusing on data as a first-class entity in AI development, it paves the way for innovative research and practices that effectively integrate data management and deep learning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Steven Euijong Whang (27 papers)
  2. Yuji Roh (11 papers)
  3. Hwanjun Song (44 papers)
  4. Jae-Gil Lee (25 papers)
Citations (241)