SubZeroCore: Unified Coreset Selection

Updated 1 October 2025

SubZeroCore is a training-free data selection method that integrates submodular coverage and density weighting to construct efficient, representative coresets.
It employs a unified objective with a closed-form estimate to balance global coverage and local density via a single interpretable hyperparameter, gamma.
Empirical evaluations on benchmarks like CIFAR-10 and ImageNet-1K demonstrate its robust performance, scalability, and resistance to label noise.

SubZeroCore is a training-free coreset selection method designed to construct representative subsets of data for efficient model training by integrating submodular coverage and density into a single unified objective. The method explicitly balances the trade-off between global coverage and local density using a closed-form solution and a single, interpretable hyperparameter, making it computationally efficient, robust to label noise, and highly scalable for real-world deep learning applications (Moser et al., 26 Sep 2025).

1. Unified Submodular Objective: Coverage and Density

The core principle of SubZeroCore is the combination of coverage—a measure of how well the subset “covers” the original data distribution—with a density-based weighting that prefers samples from regions of typical feature density. The facility location function, a well-studied submodular set function, is adapted as follows:

$f_{\text{SubZeroCore}}(\mathcal{S}) = \sum_{x_i \in \mathcal{T}} \max_{x_j \in \mathcal{S}} \left( s_j \cdot \operatorname{sim}(x_i, x_j) \right)$

Here:

$\mathcal{T}$ denotes the full dataset,
$\mathcal{S} \subseteq \mathcal{T}$ is the candidate coreset,
$\operatorname{sim}(x_i, x_j)$ is a similarity measure (e.g., cosine similarity on features),
$s_j$ is a density-based weight for $x_j$ .

The density weight $s_j$ is defined using a Gaussian kernel on the $K$ -nearest neighbor radius $r_j$ :

$s_j = \exp \left( -\frac{(r_j - \mu)^2}{2\sigma^2} \right)$

$\mu$ and $\sigma$ are, respectively, the empirical mean and standard deviation of the $K$ -NN radii across the dataset. This form ensures points in “average” density regions are emphasized, while those in outlier (low-density) or overly redundant (high-density) zones are downweighted. The submodularity of the objective ensures that a simple greedy selection algorithm achieves a $(1 - 1/e)$ -approximation to the optimum.

2. Closed-Form Sampling and Automatic Neighborhood Scale

A central innovation is the use of a closed-form estimate for coverage to determine the optimal value of $K$ (the neighborhood size) for both the density computation and the coreset coverage measurement. The expected coverage when selecting a coreset of size $|\mathcal{S}|$ is given by:

$\mathbb{E}[\text{coverage}_K(\mathcal{S}, \mathcal{T})] = 1 - \prod_{k=0}^K \frac{|\mathcal{T}| - |\mathcal{S}| - k}{|\mathcal{T}| - k}$

Given a user-specified coverage target $\gamma \in (0, 1)$ , the method numerically inverts this formula to select the minimal $K$ satisfying

$1 - \gamma \leq \prod_{k=0}^K \frac{|\mathcal{T}| - |\mathcal{S}| - k}{|\mathcal{T}| - k}$

This adaptive choice links the density estimate intrinsically to the desired coverage, obviating the need for manual tuning of $K$ and eliminating the disconnect between coverage and density scales.

3. Hyperparameterization and Control

SubZeroCore requires only a single hyperparameter, $\gamma$ , which directly encodes the desired minimum expected coverage (e.g., $\gamma = 0.6$ ). The method’s data-driven, closed-form mapping from $\gamma$ to $K$ leverages dataset statistics, yielding a principled control of the coverage-density balance:

Lower $\gamma$ : Smaller $K$ , leading to emphasis on tight, local densities (risking undercoverage).
Higher $\gamma$ : Larger $K$ , favoring global coverage (at possible cost to local detail).

This “single knob” design simplifies practical tuning and improves interpretability.

4. Empirical Performance and Benchmarking

SubZeroCore has been tested on standard benchmarks including CIFAR-10 and ImageNet-1K. Reported performance shows:

At low pruning rates (large coresets), it matches training-based methods in test accuracy.
At extreme pruning (90–99.9% samples removed), it matches or outperforms gradient-based or loss-dynamic coreset construction, which require full or partial training.
For example, in 90–99.9% data retention experiments, SubZeroCore achieves superior test accuracy and generalization relative to mixtures of gradient- or boundary-based methods.
The process is rapid: For example, on CIFAR-10, selection can be performed in minutes, without any need to calculate per-sample gradients or losses.

5. Robustness to Label Noise and Scalability

Robustness is a key advantage. The density weighting mechanism naturally reduces the effect of outliers and mislabeled instances, since such points often occupy isolated, low-density regions and receive low $s_j$ weights, limiting their impact. Under randomized label corruption (e.g., 10% labels flipped), SubZeroCore’s selected core set preserves test accuracy substantially better than training-based coresets, whose selection criteria may overfit to spurious patterns in noisy data.

The approach is also highly scalable:

It avoids any dependency on model training, making it suitable for very large datasets or rapid deployment.
The computational burden is dominated by nearest-neighbor and similarity computations in the feature space, which are efficiently parallelizable.

6. Practical and Methodological Implications

SubZeroCore’s training-free, model-agnostic construction admits application across tasks—from image classification to neural architecture search and active learning—without adaptation to specific loss landscapes or model behaviors. The deployment process is simpler, with lower energy and hardware costs, supporting environmentally sustainable machine learning practices. The interpretability of the balance between density and coverage, combined with robustness under noise, offers an appealing foundation for future research in data-efficient and resilient ML pipelines.

Feature	SubZeroCore	Training-based Baselines
Training dependency	None	Full/partial training required
Hyperparameter	Single $\gamma$ (coverage)	Multiple (training, boundary, etc.)
Robustness to noise	High (density downweights)	Lower
Scalability	High	Limited by training compute
Performance (pruned)	Matches/outperforms	Variable (degrades at high pruning)

7. Theoretical and Algorithmic Foundation

SubZeroCore’s design merges classical facility location (submodular optimization) with recent advances in scalable, training-free core-set construction. The method’s $(1-1/e)$ -approximate greedy solution and its explicit, closed-form balancing of global and local geometric criteria offer a mathematically principled approach. The density-based weighting reflects modern understanding of feature space occupancy, where moderate density typically aligns with reliable, informative samples for learning.

Empirical results across multiple domains substantiate the efficacy of this approach. The framework’s modularity and lack of model dependence suggest natural extensions into new areas of large-scale data summarization, privacy-constrained learning, and unsupervised representation selection.

PDF Markdown Chat (Pro)

References (1)

SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Core Reference Set.