Papers
Topics
Authors
Recent
2000 character limit reached

SubZeroCore: Unified Coreset Selection

Updated 1 October 2025
  • SubZeroCore is a training-free data selection method that integrates submodular coverage and density weighting to construct efficient, representative coresets.
  • It employs a unified objective with a closed-form estimate to balance global coverage and local density via a single interpretable hyperparameter, gamma.
  • Empirical evaluations on benchmarks like CIFAR-10 and ImageNet-1K demonstrate its robust performance, scalability, and resistance to label noise.

SubZeroCore is a training-free coreset selection method designed to construct representative subsets of data for efficient model training by integrating submodular coverage and density into a single unified objective. The method explicitly balances the trade-off between global coverage and local density using a closed-form solution and a single, interpretable hyperparameter, making it computationally efficient, robust to label noise, and highly scalable for real-world deep learning applications (Moser et al., 26 Sep 2025).

1. Unified Submodular Objective: Coverage and Density

The core principle of SubZeroCore is the combination of coverage—a measure of how well the subset “covers” the original data distribution—with a density-based weighting that prefers samples from regions of typical feature density. The facility location function, a well-studied submodular set function, is adapted as follows:

fSubZeroCore(S)=xiTmaxxjS(sjsim(xi,xj))f_{\text{SubZeroCore}}(\mathcal{S}) = \sum_{x_i \in \mathcal{T}} \max_{x_j \in \mathcal{S}} \left( s_j \cdot \operatorname{sim}(x_i, x_j) \right)

Here:

  • T\mathcal{T} denotes the full dataset,
  • ST\mathcal{S} \subseteq \mathcal{T} is the candidate coreset,
  • sim(xi,xj)\operatorname{sim}(x_i, x_j) is a similarity measure (e.g., cosine similarity on features),
  • sjs_j is a density-based weight for xjx_j.

The density weight sjs_j is defined using a Gaussian kernel on the KK-nearest neighbor radius rjr_j:

sj=exp((rjμ)22σ2)s_j = \exp \left( -\frac{(r_j - \mu)^2}{2\sigma^2} \right)

μ\mu and σ\sigma are, respectively, the empirical mean and standard deviation of the KK-NN radii across the dataset. This form ensures points in “average” density regions are emphasized, while those in outlier (low-density) or overly redundant (high-density) zones are downweighted. The submodularity of the objective ensures that a simple greedy selection algorithm achieves a (11/e)(1 - 1/e)-approximation to the optimum.

2. Closed-Form Sampling and Automatic Neighborhood Scale

A central innovation is the use of a closed-form estimate for coverage to determine the optimal value of KK (the neighborhood size) for both the density computation and the coreset coverage measurement. The expected coverage when selecting a coreset of size S|\mathcal{S}| is given by:

E[coverageK(S,T)]=1k=0KTSkTk\mathbb{E}[\text{coverage}_K(\mathcal{S}, \mathcal{T})] = 1 - \prod_{k=0}^K \frac{|\mathcal{T}| - |\mathcal{S}| - k}{|\mathcal{T}| - k}

Given a user-specified coverage target γ(0,1)\gamma \in (0, 1), the method numerically inverts this formula to select the minimal KK satisfying

1γk=0KTSkTk1 - \gamma \leq \prod_{k=0}^K \frac{|\mathcal{T}| - |\mathcal{S}| - k}{|\mathcal{T}| - k}

This adaptive choice links the density estimate intrinsically to the desired coverage, obviating the need for manual tuning of KK and eliminating the disconnect between coverage and density scales.

3. Hyperparameterization and Control

SubZeroCore requires only a single hyperparameter, γ\gamma, which directly encodes the desired minimum expected coverage (e.g., γ=0.6\gamma = 0.6). The method’s data-driven, closed-form mapping from γ\gamma to KK leverages dataset statistics, yielding a principled control of the coverage-density balance:

  • Lower γ\gamma: Smaller KK, leading to emphasis on tight, local densities (risking undercoverage).
  • Higher γ\gamma: Larger KK, favoring global coverage (at possible cost to local detail).

This “single knob” design simplifies practical tuning and improves interpretability.

4. Empirical Performance and Benchmarking

SubZeroCore has been tested on standard benchmarks including CIFAR-10 and ImageNet-1K. Reported performance shows:

  • At low pruning rates (large coresets), it matches training-based methods in test accuracy.
  • At extreme pruning (90–99.9% samples removed), it matches or outperforms gradient-based or loss-dynamic coreset construction, which require full or partial training.
  • For example, in 90–99.9% data retention experiments, SubZeroCore achieves superior test accuracy and generalization relative to mixtures of gradient- or boundary-based methods.
  • The process is rapid: For example, on CIFAR-10, selection can be performed in minutes, without any need to calculate per-sample gradients or losses.

5. Robustness to Label Noise and Scalability

Robustness is a key advantage. The density weighting mechanism naturally reduces the effect of outliers and mislabeled instances, since such points often occupy isolated, low-density regions and receive low sjs_j weights, limiting their impact. Under randomized label corruption (e.g., 10% labels flipped), SubZeroCore’s selected core set preserves test accuracy substantially better than training-based coresets, whose selection criteria may overfit to spurious patterns in noisy data.

The approach is also highly scalable:

  • It avoids any dependency on model training, making it suitable for very large datasets or rapid deployment.
  • The computational burden is dominated by nearest-neighbor and similarity computations in the feature space, which are efficiently parallelizable.

6. Practical and Methodological Implications

SubZeroCore’s training-free, model-agnostic construction admits application across tasks—from image classification to neural architecture search and active learning—without adaptation to specific loss landscapes or model behaviors. The deployment process is simpler, with lower energy and hardware costs, supporting environmentally sustainable machine learning practices. The interpretability of the balance between density and coverage, combined with robustness under noise, offers an appealing foundation for future research in data-efficient and resilient ML pipelines.

Feature SubZeroCore Training-based Baselines
Training dependency None Full/partial training required
Hyperparameter Single γ\gamma (coverage) Multiple (training, boundary, etc.)
Robustness to noise High (density downweights) Lower
Scalability High Limited by training compute
Performance (pruned) Matches/outperforms Variable (degrades at high pruning)

7. Theoretical and Algorithmic Foundation

SubZeroCore’s design merges classical facility location (submodular optimization) with recent advances in scalable, training-free core-set construction. The method’s (11/e)(1-1/e)-approximate greedy solution and its explicit, closed-form balancing of global and local geometric criteria offer a mathematically principled approach. The density-based weighting reflects modern understanding of feature space occupancy, where moderate density typically aligns with reliable, informative samples for learning.

Empirical results across multiple domains substantiate the efficacy of this approach. The framework’s modularity and lack of model dependence suggest natural extensions into new areas of large-scale data summarization, privacy-constrained learning, and unsupervised representation selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Core Reference Set.