Papers
Topics
Authors
Recent
Search
2000 character limit reached

Coverage-centric Coreset Selection for High Pruning Rates

Published 28 Oct 2022 in cs.LG and cs.AI | (2210.15809v2)

Abstract: One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at https://github.com/haizhongzheng/Coverage-centric-coreset-selection.

Citations (42)

Summary

  • The paper introduces a novel coverage-centric coreset selection algorithm that optimizes data coverage to maintain high accuracy at extreme pruning rates.
  • It extends the classical geometric set cover problem to a density-based distribution cover problem, establishing the new AUCpr metric as a predictor of model loss.
  • Empirical results show that CCS outperforms state-of-the-art methods by achieving 5-7% higher accuracy on benchmarks like ImageNet and CIFAR10.

Insightful Overview of the Paper on Coverage-centric Coreset Selection for High Pruning Rates

This paper addresses the challenge of one-shot coreset selection, which involves identifying a representative subset of a large training dataset to maintain high model accuracy even at high pruning rates. Existing state-of-the-art (SOTA) methods focus on selecting examples based on certain importance metrics. While effective at low pruning rates, these methods struggle significantly when the pruning rate is high, often underperforming compared to random sampling.

Theoretical Contributions

The authors provide both theoretical and empirical analysis to understand why traditional methods fail at higher pruning rates. They extend the classical geometric set cover problem to a density-based distribution cover problem, leading to the development of a new metric for data coverage. This theoretical framework is essential as it reveals the limitations of current SOTA methods—namely, their failure to ensure adequate data coverage, a critical issue exacerbated under high pruning conditions.

The paper introduces a novel concept termed AUCpr_{pr}, which quantifies the degree to which a selected coreset covers the data distribution. The AUCpr_{pr} acts as a predictive measure of model loss, offering a more detailed understanding of which data sub-samples help maintain accuracy across varying pruning rates.

Methodological Innovations

To address the shortcomings identified with current coreset selection methods, the authors propose the Coverage-centric Coreset Selection (CCS) algorithm. CCS differs from existing methods by focusing on optimizing coverage alongside data importance. By using stratified sampling, CCS ensures that even under high pruning rates, coresets maintain adequate data coverage.

The CCS algorithm employs a dual strategy: it combines a novel stratified sampling technique with a preemptive pruning of excessively complex examples. This combination contributes to its superior performance, as demonstrated by the algorithm's ability to maintain significantly higher accuracy compared to traditional methods and random sampling at high pruning rates.

Empirical Results and Implications

Empirical evaluations highlight CCS's effectiveness across five distinct datasets compared against six baseline methods. Particularly at high pruning rates (90%), CCS outperforms SOTA methods by substantial margins, such as achieving at least 5.02% and 7.04% higher accuracy on ImageNet and CIFAR10, respectively. The robustness of CCS across diverse datasets underscores its practical utility and scalability.

The study also points out that the proposed approach maintains its competitive edge without performance trade-offs at lower pruning rates, establishing it as a consistently viable strategy for one-shot coreset selection.

Future Directions

The insights garnered from the coverage-centric approach open new avenues for research in AI, particularly in developing more sophisticated methods for coreset selection across various machine learning tasks and model architectures. The research suggests potential exploration into more adaptive methods that dynamically balance between coverage and complexity based on the characteristics of specific datasets and model types.

In essence, this paper sets the groundwork for a shift from importance-centric to coverage-informed coreset selection strategies, challenging future algorithms to reconcile data coverage and importance harmoniously. The CCS method offers a promising baseline for developing more refined coreset selection techniques, especially in scenarios demanding highly efficient data utilization.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.