- The paper introduces a novel coverage-centric coreset selection algorithm that optimizes data coverage to maintain high accuracy at extreme pruning rates.
- It extends the classical geometric set cover problem to a density-based distribution cover problem, establishing the new AUCpr metric as a predictor of model loss.
- Empirical results show that CCS outperforms state-of-the-art methods by achieving 5-7% higher accuracy on benchmarks like ImageNet and CIFAR10.
Insightful Overview of the Paper on Coverage-centric Coreset Selection for High Pruning Rates
This paper addresses the challenge of one-shot coreset selection, which involves identifying a representative subset of a large training dataset to maintain high model accuracy even at high pruning rates. Existing state-of-the-art (SOTA) methods focus on selecting examples based on certain importance metrics. While effective at low pruning rates, these methods struggle significantly when the pruning rate is high, often underperforming compared to random sampling.
Theoretical Contributions
The authors provide both theoretical and empirical analysis to understand why traditional methods fail at higher pruning rates. They extend the classical geometric set cover problem to a density-based distribution cover problem, leading to the development of a new metric for data coverage. This theoretical framework is essential as it reveals the limitations of current SOTA methods—namely, their failure to ensure adequate data coverage, a critical issue exacerbated under high pruning conditions.
The paper introduces a novel concept termed AUCpr​, which quantifies the degree to which a selected coreset covers the data distribution. The AUCpr​ acts as a predictive measure of model loss, offering a more detailed understanding of which data sub-samples help maintain accuracy across varying pruning rates.
Methodological Innovations
To address the shortcomings identified with current coreset selection methods, the authors propose the Coverage-centric Coreset Selection (CCS) algorithm. CCS differs from existing methods by focusing on optimizing coverage alongside data importance. By using stratified sampling, CCS ensures that even under high pruning rates, coresets maintain adequate data coverage.
The CCS algorithm employs a dual strategy: it combines a novel stratified sampling technique with a preemptive pruning of excessively complex examples. This combination contributes to its superior performance, as demonstrated by the algorithm's ability to maintain significantly higher accuracy compared to traditional methods and random sampling at high pruning rates.
Empirical Results and Implications
Empirical evaluations highlight CCS's effectiveness across five distinct datasets compared against six baseline methods. Particularly at high pruning rates (90%), CCS outperforms SOTA methods by substantial margins, such as achieving at least 5.02% and 7.04% higher accuracy on ImageNet and CIFAR10, respectively. The robustness of CCS across diverse datasets underscores its practical utility and scalability.
The study also points out that the proposed approach maintains its competitive edge without performance trade-offs at lower pruning rates, establishing it as a consistently viable strategy for one-shot coreset selection.
Future Directions
The insights garnered from the coverage-centric approach open new avenues for research in AI, particularly in developing more sophisticated methods for coreset selection across various machine learning tasks and model architectures. The research suggests potential exploration into more adaptive methods that dynamically balance between coverage and complexity based on the characteristics of specific datasets and model types.
In essence, this paper sets the groundwork for a shift from importance-centric to coverage-informed coreset selection strategies, challenging future algorithms to reconcile data coverage and importance harmoniously. The CCS method offers a promising baseline for developing more refined coreset selection techniques, especially in scenarios demanding highly efficient data utilization.