RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning (2106.07760v2)

Published 14 Jun 2021 in cs.LG and cs.AI

Abstract: Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, significantly reducing computational costs. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution (OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around $3\times$ in the traditional SSL setting and achieves a speedup of $5\times$ compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data. RETRIEVE is available as a part of the CORDS toolkit: https://github.com/decile-team/cords.

Citations (68)

View on Semantic Scholar

Summary

The paper introduces Retrieve as a framework for coreset selection that reduces computational costs in semi-supervised learning.
It models coreset selection as a mixed discrete-continuous bi-level optimization problem solved efficiently via greedy algorithms.
Experiments on CIFAR-10 and SVHN demonstrate up to 5× speed gains with only a 0.7% accuracy drop, affirming its robustness.

Overview of "Retrieve: Coreset Selection for Efficient and Robust Semi-Supervised Learning"

The paper "Retrieve: Coreset Selection for Efficient and Robust Semi-Supervised Learning" presents a novel framework designed to optimize the efficiency and robustness of semi-supervised learning (SSL) techniques. The key proposition of this research is the usage of a subset of unlabeled data, thereby reducing computation time and energy requirements, which are often prohibitive in current state-of-the-art SSL methods.

Framework and Methodology

Retrieve is predicated on coreset selection, where the goal is to select a representative subset of the unlabeled dataset such that training on this subset results in minimal loss over the labeled dataset. The authors model coreset selection as a mixed discrete-continuous bi-level optimization problem. To efficiently solve this, they leverage a one-step gradient approximation and prove the problem to be approximately submodular, which allows for efficient solutions via greedy algorithms. Such an approach significantly reduces training times while maintaining performance levels close to those obtained with the full dataset.

Experimental Results

Extensive experiments conducted on real-world datasets like CIFAR-10 and SVHN demonstrate that integrating Retrieve with prominent SSL algorithms (such as VAT, Mean-Teacher, and FixMatch) results in substantial computational gains. More specifically, Retrieve provides a speed-up of approximately $3\times$ in traditional SSL settings and $5\times$ in scenarios involving imbalanced and Out-of-Distribution (OOD) data, with negligible accuracy degradation (around 0.7%).

Implications and Future Directions

Retrieve's ability to reduce computational expense while maintaining robustness under various conditions makes it particularly appealing for applications where data labeling is costly or impractical, such as in specialized domains like medical imaging. Additionally, the energy and cost savings from reduced training times could democratize SSL for smaller organizations with limited resources, fostering broader adoption.

Theoretically, Retrieve advances the understanding of how submodularity can be leveraged in SSL, suggesting potential avenues for applying submodular optimization in other machine learning tasks. Future work could explore further enhancements to coreset selection methods, such as incorporating diversity criteria or dynamic subset adjustments during training to further improve performance metrics or adapt to evolving data distributions.

In conclusion, this paper introduces Retrieve as a powerful framework that addresses both efficiency and robustness in SSL, making it a valuable contribution to the field with wide-ranging applications. Continued exploration in coreset selection methodologies could yield even greater advancements in the domain of semi-supervised learning.

PDF Markdown

Related Papers

GitHub

GitHub - decile-team/cords: Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection. (322 stars)

Tweets

https://twitter.com/DynamicWebPaige/status/1505344400871735300

https://twitter.com/rishiyer/status/1506849930916282368

https://twitter.com/rishiyer/status/1623090139483480066

https://twitter.com/rishiyer/status/1671730947115614208

https://twitter.com/rishiyer/status/1417526753782865923

https://twitter.com/rishiyer/status/1570426660075503616