Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost (1910.07153v2)

Published 16 Oct 2019 in cs.LG and cs.CV

Abstract: Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in most conventional methods. Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. First, we exploit both labeled and unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate the superior performance of our proposed method with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we study an important yet under-explored problem -- "When can we start learning-based AL selection?". We propose a measure that is empirically correlated with the AL target loss and is potentially useful for determining the proper starting point of learning-based AL methods.

PDF Abstract

Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost

This paper proposes an innovative approach to Active Learning (AL) that integrates Semi-supervised Learning (SSL) designed to minimize labeling costs while improving model performance. The authors highlight two main contributions: employing SSL to effectively utilize both labeled and unlabeled data during training, and a novel consistency-based sample selection metric that aligns with the training objective.

Methodology and Contributions

The authors propose a framework called consistency-based semi-supervised active learning (CSSL-AL) to address these goals. Unlike conventional AL methods that typically disregard accessible unlabeled data during initial training, this method actively leverages it, thus potentially reducing labeling costs. The framework unifies sample selection from an initial unlabeled data pool and model training, iteratively refining sample selection based on model predictions.

Key Contributions:

Semi-supervised AL framework: The paper introduces a unified approach, allowing model optimization through an SSL approach that utilizes both labeled and unlabeled data. This integration is achieved by using consistency-based loss functions in the training phase to encourage coherent model predictions, even when inputs are subject to augmentations.
Consistency-based sample selection metric: A novel evaluation metric for sample selection is proposed. This metric assesses the variability in predictions across various augmentations of unlabeled samples, selecting those with high inconsistency, suggesting they are hard for SSL to resolve without additional labeling.
Analysis of starting point size in AL: The authors address the often-overlooked issue of cold start in AL. They introduce a measure that can empirically correlate with AL target loss, potentially assisting practitioners in determining an optimal starting label size without requiring a predefined validation set.

Experimental Results and Discussion

Extensive experiments are conducted on benchmark datasets—CIFAR-10, CIFAR-100, and ImageNet. The CSSL-AL framework demonstrates superior performance compared to existing methods by significantly outperforming straightforward combinations of SSL and AL baselines, as well as recent advanced AL methods.

Notably, the proposed method requires less labeled data to achieve comparable or superior model performance, effectively halving the amount needed in certain scenarios, such as seen on CIFAR-10 with reduced labeling requirements from 4,000 to 2,000 images for similar performance levels.

The authors further investigate the properties of their sample selection metric, demonstrating how their consistency-based approach inherently balances sample uncertainty and diversity, two critical aspects often targeted individually by various AL methods.

Implications and Future Directions

The integration of SSL within the AL process opens possibilities for more efficient learning, aligning with practical scenarios where labeling costs are prohibitively high. This approach is highly beneficial for domains requiring expert annotation—like medical imaging—reducing labor and expense.

The paper encourages exploration into scenarios for optimal start size determination in CSSL-AL, prompting new studies on robust methods to track model convergence and guide active sample selection strategies. Future research could explore scaling the framework to broader tasks within AI, potentially combining it with unsupervised representation learning advancements that align with the consistency-based criterion.

In light of these findings, CSSL-AL stands as a promising direction for the application of advanced machine learning techniques in resource-constrained environments, advocating for further exploration and validation across diverse domains.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Mingfei Gao (26 papers)
Zizhao Zhang (44 papers)
Guo Yu (34 papers)
Larry S. Davis (98 papers)
Tomas Pfister (89 papers)
Sercan O. Arik (40 papers)

Citations (182)

View on Semantic Scholar

Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost (1910.07153v2)