A Survey on Data Collection for Machine Learning from a Big Data and AI Integration Perspective
The paper "A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective" provides a thorough exploration into the challenges and techniques associated with data collection for ML, particularly highlighting the integration of Big Data and AI. This integration is becoming increasingly crucial due to the substantial volumes of data required by modern ML models, especially deep learning.
Overview
Data collection is identified as a bottleneck in the ML pipeline. The scarcity of labeled data for newer ML applications, coupled with the growing need for larger datasets to fuel deep learning models, underscores this challenge. Traditional applications benefit from decades of data accumulation, but emerging applications often begin without adequate data. The paper categorizes data collection into three primary operations: data acquisition, data labeling, and the enhancement of existing data and models.
Data Acquisition
Data acquisition involves discovering, augmenting, or generating datasets. The paper explores:
- Data Discovery: Involves sharing and searching datasets. Platforms like DataHub and Google Fusion Tables illustrate collaborative and web-based data sharing, while systems like IBM's data lakes and Google's GOODS exemplify data searching in corporate and web environments.
- Data Augmentation: Enhances datasets through external additions, such as pretrained embeddings and data integration.
- Data Generation: Covers synthetic data generation techniques and the role of crowdsourcing in creating new datasets.
Data Labeling
Data labeling is critical for transforming raw data into valuable training material. The survey explores:
- Existing Labels Utilization: Techniques like self-labeling (semi-supervised learning) make use of available labels to predict unknowns.
- Crowd-based Techniques: Active learning and crowdsourcing are highlighted for their roles in efficient data labeling, though they differ in their reliance on user expertise and task scalability.
- Weak Supervision: Methods like data programming leverage less accurate labels but in large quantities to improve training data quality.
Improving Existing Data and Models
The paper also discusses methods to improve existing datasets and models when new data acquisition may not be feasible:
- Data Improvements: Involves data cleaning to address noise and inaccuracies in data, ensuring that models trained on them are robust.
- Model Enhancements: Discusses making models resilient to data noise and leveraging transfer learning to adapt existing models for new applications.
Implications and Future Directions
The convergence of machine learning and data management opens avenues for more integrated approaches in data collection. Future challenges include:
- Developing better methods to evaluate the sufficiency and selection of data.
- Balancing accuracy and scalability in labeling techniques.
- Improving human-computer collaboration in data labeling and programming.
- Expanding empirical evaluations to effectively compare techniques across applications.
- Generalizing specific methods to wider contexts while fostering integration across collection processes.
Conclusion
The integration of Big Data and AI represents a paradigm shift in handling the complexities and demands of modern machine learning. This survey underscores the necessity of nuanced approaches to data collection, proposing a framework that navigates the intricate balance between efficiency, scalability, and accuracy across diverse applications.