A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective (1811.03402v2)

Published 8 Nov 2018 in cs.LG and stat.ML

Abstract: Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and AI integration and opens many opportunities for new research.

PDF Abstract

A Survey on Data Collection for Machine Learning from a Big Data and AI Integration Perspective

The paper "A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective" provides a thorough exploration into the challenges and techniques associated with data collection for ML, particularly highlighting the integration of Big Data and AI. This integration is becoming increasingly crucial due to the substantial volumes of data required by modern ML models, especially deep learning.

Overview

Data collection is identified as a bottleneck in the ML pipeline. The scarcity of labeled data for newer ML applications, coupled with the growing need for larger datasets to fuel deep learning models, underscores this challenge. Traditional applications benefit from decades of data accumulation, but emerging applications often begin without adequate data. The paper categorizes data collection into three primary operations: data acquisition, data labeling, and the enhancement of existing data and models.

Data Acquisition

Data acquisition involves discovering, augmenting, or generating datasets. The paper explores:

Data Discovery: Involves sharing and searching datasets. Platforms like DataHub and Google Fusion Tables illustrate collaborative and web-based data sharing, while systems like IBM's data lakes and Google's GOODS exemplify data searching in corporate and web environments.
Data Augmentation: Enhances datasets through external additions, such as pretrained embeddings and data integration.
Data Generation: Covers synthetic data generation techniques and the role of crowdsourcing in creating new datasets.

Data Labeling

Data labeling is critical for transforming raw data into valuable training material. The survey explores:

Existing Labels Utilization: Techniques like self-labeling (semi-supervised learning) make use of available labels to predict unknowns.
Crowd-based Techniques: Active learning and crowdsourcing are highlighted for their roles in efficient data labeling, though they differ in their reliance on user expertise and task scalability.
Weak Supervision: Methods like data programming leverage less accurate labels but in large quantities to improve training data quality.

Improving Existing Data and Models

The paper also discusses methods to improve existing datasets and models when new data acquisition may not be feasible:

Data Improvements: Involves data cleaning to address noise and inaccuracies in data, ensuring that models trained on them are robust.
Model Enhancements: Discusses making models resilient to data noise and leveraging transfer learning to adapt existing models for new applications.

Implications and Future Directions

The convergence of machine learning and data management opens avenues for more integrated approaches in data collection. Future challenges include:

Developing better methods to evaluate the sufficiency and selection of data.
Balancing accuracy and scalability in labeling techniques.
Improving human-computer collaboration in data labeling and programming.
Expanding empirical evaluations to effectively compare techniques across applications.
Generalizing specific methods to wider contexts while fostering integration across collection processes.

Conclusion

The integration of Big Data and AI represents a paradigm shift in handling the complexities and demands of modern machine learning. This survey underscores the necessity of nuanced approaches to data collection, proposing a framework that navigates the intricate balance between efficiency, scalability, and accuracy across diverse applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yuji Roh (11 papers)
Geon Heo (7 papers)
Steven Euijong Whang (27 papers)

Citations (628)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos