A Survey on Deep Active Learning: Recent Advances and New Frontiers (2405.00334v2)

Published 1 May 2024 in cs.LG

Abstract: Active learning seeks to achieve strong performance with fewer training samples. It does this by iteratively asking an oracle to label new selected samples in a human-in-the-loop manner. This technique has gained increasing popularity due to its broad applicability, yet its survey papers, especially for deep learning-based active learning (DAL), remain scarce. Therefore, we conduct an advanced and comprehensive survey on DAL. We first introduce reviewed paper collection and filtering. Second, we formally define the DAL task and summarize the most influential baselines and widely used datasets. Third, we systematically provide a taxonomy of DAL methods from five perspectives, including annotation types, query strategies, deep model architectures, learning paradigms, and training processes, and objectively analyze their strengths and weaknesses. Then, we comprehensively summarize main applications of DAL in NLP, Computer Vision (CV), and Data Mining (DM), etc. Finally, we discuss challenges and perspectives after a detailed analysis of current studies. This work aims to serve as a useful and quick guide for researchers in overcoming difficulties in DAL. We hope that this survey will spur further progress in this burgeoning field.

PDF Abstract

A Comprehensive Survey on Deep Active Learning

Understanding Deep Active Learning (DAL)

Deep Active Learning (DAL) blends the strengths of deep learning (DL) with the cost-efficiency of active learning (AL). AL is known for its ability to create robust models using less labeled data by iteratively querying the most informative samples for labeling. Leveraging DL in this mix harnesses the representation-learning power of deep networks.

Core Concepts and Definitions

At its heart, DAL involves training initial models on pre-labeled data, applying querying strategies to select uncertain or informative samples from a pool of unlabeled data, and iteratively refining the model with newly labeled data. This cycle enhances model performance without requiring vast labeled datasets. It's crucial in fields like medical imaging or speech recognition, where data labeling is excessively costly or impractical.

Methodological Framework

DAL techniques are categorized from different perspectives:

Annotation Types: This includes traditional hard annotations (clear-cut categorical labels), soft annotations (probabilistic labels or continuous variables), and hybrid annotations (combining automated and manual labeling methods).
Query Strategies: These are tactics to select data points from the unlabeled pool. Common strategies include uncertainty sampling (selecting data points the model is least certain about) and diversity sampling (choosing samples that are diverse from those already labeled).
Model Architectures: From traditional machine learning models to more complex architectures like CNNs and Transformers, the choice of model significantly influences the active learning process.
Learning Paradigms: This discusses how learning settings like semi-supervised learning and transfer learning can be adapted to leverage unlabeled data effectively.
Training Processes: Various training schemes like traditional training, curriculum learning, and fine-tuning pre-trained models are applicable, each with specific advantages in the DAL context.

Practical Applications

DAL has been effectively applied across several domains:

NLP: In tasks like text classification and summarization, DAL helps to efficiently annotate textual data, which can be voluminous and complex.
Computer Vision (CV): DAL is invaluable in scenarios like image classification and object recognition, where manual labeling of vast image datasets is cumbersome.
Data Mining: Techniques applied in node classification or link prediction in networked data benefit from DAL by reducing the need for exhaustive labeling.

Addressing Challenges and Future Perspectives

Despite its benefits, DAL faces challenges such as:

Data Scarcity and Imbalance: DAL must efficiently handle scenarios where relevant data is scarce or imbalanced across categories.
Dependency on Initial Data: The reliance on initial labeled data can affect the efficacy of the AL model, often requiring strategic data selection from the outset.
Integration with Advanced DL Models: Adapting DAL strategies to work seamlessly with complex DL models remains an ongoing research area.

Conclusion

As DAL continues to evolve, it remains a key area of research due to its potential to reduce the annotation burden significantly while leveraging the growing computational power and sophistication of deep learning techniques. Future developments may focus on more robust integration strategies between DL and AL, addressing challenges in data diversity and computational efficiency.