Coreset Subsampling Overview
- Coreset subsampling is a technique that constructs a small, weighted subset of data to approximate full-data performance with provable accuracy.
- It leverages sensitivity, importance, and diversity-based sampling to significantly reduce computational costs in various statistical and machine learning tasks.
- Advanced methods, including submodular maximization and Bayesian coresets, offer scalable solutions for high-dimensional clustering, regression, and inference.
Coreset subsampling is a central framework for dataset reduction across computational statistics, machine learning, signal processing, Bayesian inference, and numerical linear algebra. The core idea is to construct a small, weighted subset (the "coreset") of a much larger dataset such that key model or inference tasks performed on the coreset approximate those on the full data according to predefined guarantees. Coreset subsampling strategies leverage probabilistic, combinatorial, geometric, convex, or submodular properties of the data and the corresponding optimization problems, enabling algorithmic and statistical speedups, scalability to massive data, memory and energy savings, and new theoretical insights into data summarization.
1. Coreset Fundamentals: Definitions, Problem Statements, and Applicability
A coreset for a problem is a (typically small) weighted subset whose objective, or cost, approximates that of the full data to provable accuracy. Let be a weighted dataset, a family of queries (e.g., model parameters, clusters), and a loss/cost function. A weighted subset of is an -coreset if for all : This structure covers mean/variance estimation (Maalouf et al., 2021), -means/median clustering, -line clustering, subspace approximation, SVMs (Tukan et al., 2020), kernel density estimation (Zheng et al., 2017), low-rank factorization (Maalouf et al., 2019, Li et al., 2022), and Bayesian inference (Chen et al., 2023, Manousakas et al., 2022, Naik et al., 2022).
Core applications include:
- Efficient model training and validation with reduced data
- Fast hyperparameter and architecture sweeps
- Streaming, distributed, and federated learning
- Accelerated optimization and Bayesian inference
- Robustness to noisy or adversarial data
Coreset size typically depends on data dimension, model complexity, error tolerance, and the "sensitivity" structure of the specific problem.
2. Sensitivity, Importance, and Diversity Sampling Frameworks
Sensitivity sampling is foundational for many coreset constructions (Braverman et al., 2016, Maalouf et al., 2019). For each data point , its sensitivity
quantifies its maximal influence on the objective. The sum governs sample complexity: sampling points with probability proportional to and appropriate rescales yields an -coreset for cost functions of VC-dimension (Braverman et al., 2016). This framework unifies theoretical guarantees for -clustering (Huang et al., 2020), SVM (Tukan et al., 2020), regression (Li et al., 2022), density estimation (Turner et al., 2020), and many “near-convex” problems (Tukan et al., 2020).
Importance sampling generalizes sensitivity sampling with heuristic or problem-specific weights (leverage scores, gradient magnitudes, combined influence metrics). Diversity-based sampling (e.g., Determinantal Point Processes, DPPs) introduces negative correlations to reduce redundancy (Tremblay et al., 2018), strictly lowering estimator variance and often yielding superior subsample efficiency, especially in clustering and regression.
3. Submodular, Geometric, and Modern Non-Sensitivity-Based Coresets
Recent advances address the empirical and computational limitations of sensitivity and importance sampling for high-dimensional, nonconvex, or deep learning settings. Submodular maximization—specifically facility location and related functions—yields robust streaming-compatible greedy algorithms with -optimality guarantees for set selection, commonly used in SubZeroCore (Moser et al., 26 Sep 2025) and deep coreset libraries (Guo et al., 2022). These methods synthesize density, coverage, and diversity criteria in the coreset objective and leverage scalable -nearest neighbor search and lazy greedy maximization.
Geometric partition/aggregation approaches (such as ring decomposition for clustering (Braverman et al., 2022)) can, sometimes surprisingly, enable pure uniform sampling or VC-based approximations with coreset size independent of . These methods are particularly effective for constrained clustering (capacitated, fair, or Wasserstein barycenter), and yield smaller, sometimes optimal -dependent coresets in low dimensions.
4. Specialized and Advanced Coreset Constructions
Bayesian coresets recast posterior inference as a data summarization problem, optimizing weighted KL-divergence between the full and coreset posterior (Naik et al., 2022, Chen et al., 2023, Manousakas et al., 2022). Greedy variational, quasi-Newton, and even MCMC-based joint sample-weighted schemes are established, with explicit high-probability guarantees, control of approximation error in total variation or two-moment KL, and extension to intractable BNNs and other models.
For kernel density estimation and general smooth divergences (including Sinkhorn), Carathéodory or kernel quadrature-based strategies (Turner et al., 2020, Kokot et al., 28 Apr 2025) enable coreset construction via moment or maximum mean discrepancy minimization. These methods achieve minimax-optimal risk and, especially for Sinkhorn divergence, achieve sublinear () coreset size with rigorous statistical control.
Element-wise core-sets (Li et al., 2022, Xue et al., 22 Sep 2025) select large-magnitude entries per column (rather than rows), optimally exploiting numerical sparsity in regression or matrix factorization (e.g., ALS for recommender systems), providing notable computation and accuracy speedups in very high dimensions.
5. Empirical Performance, Complexity, and Practical Choices
Empirical evaluations consistently support the theoretical speedup and compression of coreset subsampling, but reveal context-dependent trade-offs and the need for careful baseline testing:
- For mean, clustering, regression, and graphical estimation, sensitivity or diversity-based coresets produce 5–50 lower error than uniform at equal size, especially in high-variance or highly redundant data regimes (Maalouf et al., 2019, Tukan et al., 2020, Zheng et al., 2017, Tremblay et al., 2018, Li et al., 2022, Vahidian et al., 2019).
- In deep learning, especially under moderate budgets and robust architectures, random or stratified sampling is often surprisingly competitive (Guo et al., 2022, Lu et al., 2023). Training-free geometric or submodular schemes (e.g., SubZeroCore (Moser et al., 26 Sep 2025)) increasingly outperform gradient or error-based baselines at extreme pruning.
- Quasi-Newton and variational refinement for Bayesian coresets can attain near full-data posterior accuracy with $10$– speedup for moderate , but communication/storage costs remain challenging for extreme scales (Naik et al., 2022, Chen et al., 2023, Manousakas et al., 2022).
Generic complexity is at most a small multiple of original data passes ( to , or for Bayesian MCMC/VI) unless advanced approximate nearest neighbor or randomized algebraic routines are used.
6. Extensions: Streaming, Distributed, Budget-Aware, and Robust Coresets
Merge-and-reduce paradigms (Braverman et al., 2016, Maalouf et al., 2019) enable scalable streaming and distributed coresets: per-block summaries are coresetized locally, then recursively merged and re-coresetized, maintaining polylogarithmic size, communication, and error. Robust variants—such as median-of-means aggregation in linear regression (Li et al., 2022) or cost-aware greedy schemes for graph summarization (Vahidian et al., 2019)—adapt to outliers, nonuniform costs, and adversarial contamination. Element- or block-wise selection is particularly effective in networks, tensor decompositions, and large-scale collaborative filtering (Xue et al., 22 Sep 2025).
7. Limitations, Open Problems, and Future Directions
Despite theoretical and empirical success, practical deployment of coresets reveals several gaps:
- Sensitivity bounds are sometimes too loose to outperform uniform sampling, particularly in loosely regularized or low-variance models (Lu et al., 2023).
- Many advanced coresets have substantial model/hyperparameter dependency or require pretraining or nontrivial feature engineering (Moser et al., 26 Sep 2025, Guo et al., 2022).
- Finite-sample and sharp minimax bounds for high-dimensional, nonconvex, or composite objectives remain open (Kokot et al., 28 Apr 2025, Manousakas et al., 2022).
- Developing fully automatic, adaptive, or data-driven coreset size selectors and deeper integration into iterative machine learning pipelines (e.g., coreset MCMC, dataset distillation) are active lines of research.
Further connections of coreset construction to kernel quadrature, moment and score matching, discrepancy theory, and randomized numerical linear algebra continue to deepen and broaden the scope of efficient and theoretically principled data summarization.
Selected references: (Braverman et al., 2016, Tremblay et al., 2018, Zheng et al., 2017, Tukan et al., 2020, Huang et al., 2020, Maalouf et al., 2019, Naik et al., 2022, Li et al., 2022, Chen et al., 2023, Manousakas et al., 2022, Turner et al., 2020, Braverman et al., 2022, Vahidian et al., 2019, Moser et al., 26 Sep 2025, Guo et al., 2022, Lu et al., 2023, Tukan et al., 2020, Xue et al., 22 Sep 2025, Kokot et al., 28 Apr 2025).