Active Learning Through a Covering Lens (2205.11320v3)

Published 23 May 2022 in cs.LG

Abstract: Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover - a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/TypiClust.

Authors (4)

Ofer Yehuda (1 paper)
Avihu Dekel (8 papers)
Guy Hacohen (12 papers)
Daphna Weinshall (31 papers)

Citations (38)

View on Semantic Scholar

Summary

Active Learning Through a Covering Lens: Analysis and Applications

The paper focuses on enhancing deep active learning (AL), with a particular emphasis on the challenges presented in the low-budget regime, where only a small subset of data can be annotated. This scenario is significant as it can lead to suboptimal model performance often termed as a "cold start" problem. The authors introduce ProbCover, a novel AL strategy designed to effectively select data points for annotation by maximizing probability coverage under specific geometric conditions.

In recent years, representation and self-supervised learning have provided robust tools to analyze the structural geometry of data, thereby enabling new and efficient strategies for subset selection in AL. The paper leverages these advancements to propose a theoretical framework that analyzes AL strategies within embedding spaces and describes dual interpretations suitable for both low and high-budget AL regimes.

At the core of the ProbCover method is the Max Probability Cover problem, which aims to maximize the probability of a union of small "balls" or regions within the dataset, defined by a fixed radius. This approach is rooted in minimizing an upper bound on the generalization error of nearest-neighbor models, under the assumption of reasonable separation in the data's semantic embedding space.

The paper argues that for high-budget AL scenarios, methods like Coreset—which aim to achieve complete data coverage by minimizing the cover's size—are better suited, whereas the low-budget scenario is adeptly addressed by maximizing coverage probability with ProbCover. This duality highlights different optimization problems inherent in high versus low-budget strategies, each requiring different computational considerations.

The algorithm proposed in this work, ProbCover, utilizes self-supervised learned embedding spaces to effectively query samples, balancing computational complexity with selection efficiency using a greedy algorithm. Empirically, the algorithm demonstrates significant improvements over various contemporary AL strategies in low-budget scenarios across different datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet, particularly when integrated with semi-supervised learning frameworks.

The paper makes compelling strides in addressing cold start issues by suggesting that care in initial labeling, facilitated by probabilistically rich representations of the data distribution, can significantly enhance learning efficiency. This directs future research possibilities toward refining hyper-parameter selection for radius size in embedding spaces, exploring soft-coverage metrics, and developing problem-specific adaptations of the Max Probability Cover framework.

Overall, the research offers a strong theoretical basis and practical guidance for solving AL challenges, especially vital for domains relying on costly expert annotations, and contributes meaningfully to advancing active learning methodologies in low-resource settings.

PDF Markdown

Related Papers

GitHub

GitHub - avihu111/TypiClust: Active Learning on a Budget - Opposite Strategies Suit High and Low Budgets (77 stars)