Optimized Coreset Selection

Updated 24 December 2025

Optimized coreset selection is a technique that builds compact, representative data subsets to yield full-data performance with reduced computational cost.
It leverages methods such as hardness-based, training-free submodular selection, and adaptive pruning to enhance robustness against noise and adversarial attacks.
State-of-the-art approaches like EasyCore, SubZeroCore, HyperCore, and FAST offer formal guarantees, improved scalability, and significant accuracy gains in diverse learning tasks.

Optimized coreset selection refers to the design and application of algorithms that construct compact, representative subsets (“coresets”) of large datasets, such that training on the coreset efficiently yields model performance comparable to full-data training—even under adverse, noisy, or adversarial regimes. Recent advances formalize, analyze, and empirically validate coreset selection methods with strong theoretical guarantees and state-of-the-art empirical performance across robustness, scalability, and label-noise resilience, leading to new paradigms for efficient data-centric learning (Ramesh et al., 13 Oct 2025, Moser et al., 26 Sep 2025, Moser et al., 26 Sep 2025).

1. Hardness-based Coreset Selection and Adversarial Robustness

A principal recent breakthrough is the explicit linking of sample-wise “hardness”—quantified via the Average Input Gradient Norm (AIGN)—to adversarial vulnerability and model robustness (Ramesh et al., 13 Oct 2025). The AIGN of sample $(x_i, y_i)$ for a model $f_\theta$ trained over $T$ epochs is defined as

$\operatorname{AIGN}(x_i) = \frac{1}{T} \sum_{t=1}^T \left\| \nabla_x \ell(f_{\theta^{(t)}}(x_i), y_i) \right\|_2$

where $\ell(\cdot)$ is the per-sample loss. Sorting samples by low AIGN yields “easy” examples, which lie farthest from the decision boundary, exhibit high adversarial robustness, and induce wider margins when used preferentially in training.

The EasyCore algorithm proceeds by (a) accumulating per-sample input gradients over $T$ epochs, (b) computing AIGN per sample, and (c) statically selecting the lowest-AIGN fraction $f$ (commonly $5\%$ to $60\%$ ). Training on this coreset, under both standard and adversarial pipelines (e.g., PGD, TRADES), consistently yields up to $+7$ percentage points in adversarial accuracy compared to uniform, gradient-matching, or k-center baselines. Theoretical analysis attributes the gains to increased decision boundary margins and reduced boundary curvature, with high-AIGN example removal freeing model capacity to learn generalizable, robust features (Ramesh et al., 13 Oct 2025).

2. Submodular, Training-Free, and Density-Aware Selection

Optimized coreset methods have advanced beyond gradient or label-centric criteria to entirely training-free selection. SubZeroCore achieves this by combining submodular coverage and density weighting into a single monotone objective (Moser et al., 26 Sep 2025): $f_{\text{SubZeroCore}}(\mathcal{S}) = \sum_{i=1}^N \max_{j \in \mathcal{S}} \left[ s_j \cdot \operatorname{sim}(\mathbf{x}_i, \mathbf{x}_j) \right]$ where $s_j$ is a density score for candidate $\mathbf{x}_j$ , computed using a closed-form expectation over $K$ -nearest neighbor radii, tuned via a single hyperparameter $\gamma$ controlling density–coverage trade-off.

SubZeroCore is fully label and training agnostic, requiring only embedding similarities; its greedy maximization yields a $(1-1/e)$ -approximation to the optimal subset. Empirically, SubZeroCore matches or surpasses gradient-based coreset methods at high pruning ratios ( $\alpha \geq 0.8$ ), with superior runtime (e.g., $1$ min for CIFAR-10 at $\alpha=0.99$ vs. $10$–$50$ min for training-based signals) and robustness under label noise injections. These properties make SubZeroCore highly scalable and robust for large, noisy, or web-scale data (Moser et al., 26 Sep 2025).

3. Robust and Adaptive Coreset Selection Under Annotation Noise

HyperCore introduces class-wise hypersphere models for robust coreset selection in environments with noisy or ambiguous labels (Moser et al., 26 Sep 2025). For each class $k$ , a small neural network $\phi_k$ maps samples to $\mathbb{R}^d$ , and in-class samples (label $k$ ) are clustered near the origin inside a hypersphere, while outliers (mislabeled, corrupted, or out-of-class points) lie outside.

Pruning is performed adaptively by maximizing Youden’s $J$ statistic over the conformity distance distribution within and outside each class: $J_k(\tau) = \operatorname{TPR}_k(\tau) - \operatorname{FPR}_k(\tau)$ where TPR is the true positive rate for in-class points at radius $\tau$ , and FPR is the false positive rate for outliers. The optimal retention threshold $\tau_k^*$ is set to maximize $J_k$ . This process discards ambiguous and noisy examples while keeping the informative core.

Experiments demonstrate HyperCore’s superiority under both moderate and severe label noise, low-data regimes, and imbalanced conditions: at $1\%$ retention, accuracy gains over best baselines reach $+4.6$ points (Moser et al., 26 Sep 2025). The method is highly parallelizable and efficiently scalable.

4. Distribution Matching, Topological Constraints, and Frequency-Domain Methods

FAST leverages graph spectral theory and distributional matching in the frequency domain for DNN-free optimized coreset selection (Cui et al., 22 Nov 2025). The task is formulated as minimizing the characteristic function distance (CFD) between the empirical distribution of the full dataset $X$ and candidate coreset $S$ : $D_{\text{CFD}}(X, S) = \int_{\mathbb{R}^d} |\phi_X(\omega) - \phi_S(\omega)|^2 d\mu(\omega)$ where $\phi_X$ is the empirical characteristic function evaluated at frequency $\omega$ .

To overcome the “vanishing phase gradient” in mid/high frequency $\omega$ regimes, FAST introduces an attenuated phase-decoupled CFD loss, with targeted penalty in regions where only the phase conveys essential distributional information. Topology is preserved by extracting Laplacian eigenvectors from a multi-scale graph on $X$ , and the selection is constrained to align in manifold space by Hungarian matching and Laplacian regularization. Progressive Discrepancy-Aware Sampling schedules frequency sampling from low to high to match global structure prior to local refinement.

Benchmarks show FAST achieves up to $+9.12\%$ accuracy gains over SOTA DNN-free methods, $2.2\times$ CPU speedup, and a $96.57\%$ reduction in power consumption, with cross-model generalization loss $<0.53\%$ (Cui et al., 22 Nov 2025).

5. Coreset Selection for Structured and Non-classification Tasks

Optimized coreset selection has been adapted to object detection, continual learning, and dynamic systems identification. The CSOD framework for object detection aggregates classwise and imagewise RoI features, then applies a submodular facility-location objective balancing representativeness and diversity (Lee et al., 14 Apr 2024). Greedy maximization yields a performance boost of $+6.4$ points in AP $_{50}$ over random selection on Pascal VOC, with scalability to BDD100k and COCO2017.

Online coreset selection in dynamic systems uses geometric penetration criteria and polyhedral volume reduction guarantees for system identification, maintaining vanishing selection rates ( $O((\ln K)^2)$ over $K$ timesteps) while ensuring convergence of the feasible parameter set (Li et al., 28 Jun 2025).

For rehearsal-based continual learning, OCS maximizes gradient similarity scores online for current adaptation and memory affinity, yielding strong empirical results across standard, imbalanced, and noisy benchmarks (Yoon et al., 2021).

6. Theoretical Guarantees and Empirical Performance

State-of-the-art optimized coreset selection methods exhibit formal approximation guarantees and efficiency bounds. SubZeroCore and CSOD achieve provable $(1-1/e)$ approximations to their objectives via greedy selection (Moser et al., 26 Sep 2025, Lee et al., 14 Apr 2024). Distribution-matching methods (Cui et al., 22 Nov 2025, Kokot et al., 28 Apr 2025) show poly-logarithmic or exponential sample size reduction for lossless compression under smooth divergences and RKHS spectral decay, with matching error rates to random sampling.

Hardness-based selection (Ramesh et al., 13 Oct 2025) demonstrates statistically significant robustness gains under adversarial training, outperforming model-dependent, gradient-based, and dynamic coreset methods by $+7$ points adv-accuracy (CIFAR-100, PGD-20). Robust selection methods such as HyperCore adaptively prune noisy data and outperform fixed-ratio selectors under rising label noise (Moser et al., 26 Sep 2025). FAST and SubZeroCore offer massive reductions in compute and power costs while preserving or improving downstream accuracy (Cui et al., 22 Nov 2025, Moser et al., 26 Sep 2025).

7. Integration and Deployment Considerations

Optimized coreset selection algorithms are highly modular. EasyCore requires only a single pre-training gradient accumulation run for AIGN computation; its scores are transferable across architectures and training protocols (Ramesh et al., 13 Oct 2025). SubZeroCore can operate on raw embeddings from any pretrained model, needing no labels or network training, and is suitable for large-scale noisy or unlabeled data (Moser et al., 26 Sep 2025). HyperCore trains small MLPs per class and requires no tuning of coreset size, automatically adjusting via Youden’s $J$ statistic (Moser et al., 26 Sep 2025). FAST only needs Laplacian graph construction and frequency library sampling, enabling efficient execution on edge devices (Cui et al., 22 Nov 2025).

Empirical guidance recommends coreset fractions in the range $5\%$ – $60\%$ , with class-balancing and feature-space diversity constraints applied for high-imbalance or multi-modal settings. All leading approaches exceed or match baseline performance across classification, detection, continual learning, and regression tasks, with the added benefit of substantial computational savings.

Optimized coreset selection now offers rigorous, scalable, and robust frameworks for reducing data and computational burdens in deep learning, with theoretical guarantees, superior noisiness resilience, and ease of integration into diverse data-centric pipelines (Ramesh et al., 13 Oct 2025, Moser et al., 26 Sep 2025, Moser et al., 26 Sep 2025, Cui et al., 22 Nov 2025).