Greedy Coreset Subsampling
- Greedy coreset subsampling is an algorithmic framework that selects a small, weighted subset from large datasets by iteratively maximizing marginal gains under fidelity constraints.
- It offers theoretical guarantees and approximation bounds across tasks such as matrix approximation, clustering, and Bayesian inference, ensuring near-optimal performance.
- Practical implementations use single-machine, distributed, and randomized strategies to reduce computation time and maintain robustness in high-dimensional and complex data settings.
Greedy coreset subsampling refers to a class of algorithmic frameworks that use greedy selection strategies to construct compact, representative subsets ("coresets") for large-scale data problems, typically under cardinality or fidelity constraints. These methods aim to preserve a quantitative objective—such as data coverage, statistical error, or optimization loss—while operating at a fraction of the cost and size of the original dataset. The greedy paradigm is characterized by iterative selection: at each step, the item which maximally improves the objective (e.g., marginal gain) is added to the coreset, sometimes with further local or distributed refinements. Recent research establishes theoretical approximation guarantees, distributed variants, and practical scalability for greedy coreset subsampling across numerous domains, including matrix approximation, clustering, determinant maximization, Bayesian inference, and training data pruning.
1. Greedy Coreset Subsampling: Core Principles and Problem Formulations
Greedy coreset subsampling arises in diverse data summarization tasks, all unified by the need to select a small weighted subset that suffices for approximating an algorithmic objective over the full dataset. In the prototypical Column Subset Selection (CSS) problem, given an data matrix and a target rank , the task is to choose columns maximizing
The selected set acts as a coreset enabling downstream linear tasks with controlled error. A -approximate coreset of size satisfies , with 0, where 1 is the smallest squared singular value of the optimal 2 submatrix (Altschuler et al., 2016).
Analogous greedy formulations are central in:
- Determinant maximization (selecting 3 vectors maximizing 4) (Gollapudi et al., 2023),
- 5-center clustering with outliers (selecting centers minimizing max-radius for inlier covering) (Ding et al., 2019, Ding et al., 2023),
- Mean estimation on graphs or sensor networks (Vahidian et al., 2019),
- Data pruning for deep learning or instruction tuning (Zhang et al., 2024, Moser et al., 26 Sep 2025).
In each case, the coreset is designed so any algorithm solving the full problem instance can be applied to the subsample with theoretical fidelity guarantees.
2. Greedy Algorithms: Single-Machine, Distributed, Randomized, and Submodular Extensions
The canonical greedy routine iteratively selects the candidate that yields maximum marginal gain in the target objective. In the CSS problem, starting with 6, for 7 steps,
8
and 9 is added to 0 (Altschuler et al., 2016). The determinant maximization problem uses a similar loop, maximizing the volume of the current set (Gollapudi et al., 2023). For 1-center clustering with 2 outliers, a farthest-point greedy augmented to ignore extreme outliers is employed (Ding et al., 2019, Ding et al., 2023).
Recent research advances include:
- Randomized or "lazier-than-lazy" variants, sampling candidate subsets per iteration to reduce computational cost at slight accuracy tradeoff (expectation bounds still hold) (Altschuler et al., 2016).
- Composable/distributed algorithms, partitioning data across 3 workers, running local greedy coresets, and then merging via further rounds of greedy selection, with provable approximation factors depending on condition numbers of the objective matrix or local optimality gaps (Altschuler et al., 2016, Gollapudi et al., 2023, Ding et al., 2023).
- Submodular maximization context, in which the greedy method achieves a 4-approximation for monotone submodular objectives under cardinality constraints. Many practical objectives (e.g., facility location, coverage) are amenable to this structure, as are unifications of coverage/density (Moser et al., 26 Sep 2025).
- Weakly submodular/approximate-submodular cases, in complex problems such as semi-supervised learning, the greedy selection function is shown to be approximately submodular, retaining meaningful approximation bounds (Killamsetty et al., 2021, Zhang et al., 2024).
3. Theoretical Guarantees and Approximation Bounds
State-of-the-art greedy coreset subsampling provides:
- CSS (Frobenius mass): Theorem 3.1 of (Altschuler et al., 2016) states that greedy selection with 5 steps yields 6; this bound is tight up to a constant.
- Determinant maximization: Greedy yields an 7-composable coreset; any set that is 8-locally optimal under swaps produces such a coreset. The key local-optimality lemma shows that the volume increase from a single swap is at most 9 (Gollapudi et al., 2023).
- 0-center clustering: In doubling metrics (dimension 1), greedy with 2 centers yields a 3-coreset (Ding et al., 2019, Ding et al., 2023). For general metrics, the method gives a 2-approximation with 4 size.
- Bayesian inference: Greedy geodesic ascent (GIGA) achieves geometric convergence of the approximation error: at coreset size 5, the residual 6 decays as 7 for some 8 (Campbell et al., 2018); information-geometric and Riemannian variants further optimize the KL divergence to the exact posterior (Campbell et al., 2019).
- Semi-supervised learning and other weakly submodular settings: Approximate submodularity parameter 9 controls the approximation: the stochastic greedy approach achieves 0 approximation (Killamsetty et al., 2021).
Greedy methods are often (up to logarithmic or condition number factors) near-optimal among polynomial-time algorithms for these classes of objectives.
4. Practical Implementations, Scaling, and Empirical Results
Efficient greedy coreset construction employs both algorithmic and numerical optimization:
- Marginal gain computation: Classical greedy runs 1 marginal gain evaluations, but random projections, projection-cost presketching, and lazy evaluations reduce costs to 2 (Altschuler et al., 2016).
- Maintenance of orthogonal bases: In determinant maximization or CSS, maintaining a QR factorization enables 3 time implementations (Gollapudi et al., 2023).
- Distributed and composable strategies: Each partition builds a local coreset; further greedy aggregation is performed centrally, controlling overall communication and memory footprints (Altschuler et al., 2016, Ding et al., 2023).
- Selection in non-vectorial domains: In graphs, greedy selection operates on spectral/diffusion embeddings or random-walk projections, and adds cost constraints as in sensor placement (Vahidian et al., 2019, Ding et al., 2024).
- Gradient-based objectives: For instruction tuning or SSL, gradients or model update directions parameterize data utility; greedy routines operate in the projected gradient space, sometimes after clustering (Zhang et al., 2024).
Empirically, greedy coreset methods have been shown to:
- Match or closely approximate full-data performance in regression, SVM training, clustering, and GNN training using only 4 of the data (Altschuler et al., 2016, Ding et al., 2024, Zhang et al., 2024).
- Yield 3–105 reductions in overall wall-time for large-scale learning tasks (Altschuler et al., 2016, Killamsetty et al., 2021).
- Remain robust under the presence of outliers, class imbalance, and low-homophily in graph data (Ding et al., 2023, Ding et al., 2024).
5. Variations: Submodular, Information-Geometric, and Weakly Submodular Greedy Coreset Methods
Beyond classic greedy maximization, several structural variations exist:
- Submodular greedy: Facility location or coverage objectives for deep learning pruning use submodular greedy algorithms, with extensions for integrating density or representativeness (e.g., SubZeroCore) (Moser et al., 26 Sep 2025).
- Geometric and Riemannian greedy: GIGA and its Riemannian extensions operate by greedy alignment on the unit sphere or under the Fisher information metric, crucial for Bayesian coresets and mean estimation on graphs (Campbell et al., 2018, Campbell et al., 2019, Vahidian et al., 2019).
- Weakly submodular maximization: For objectives not strictly submodular but satisfying approximate submodularity, e.g., in RETRIEVE for SSL, the greedy algorithm retains controlled approximation factors (Killamsetty et al., 2021).
- Clustering in feature or gradient space before greedy selection: Improves balance and coverage in heterogenous data (e.g., TAGCOS) (Zhang et al., 2024).
These generalizations maintain the greedy character while addressing objective smoothness, curvature, or domain-specific structural constraints.
6. Limitations, Extensions, and Open Challenges
While greedy coreset subsampling is well-understood in several cases, limitations and future directions include:
- Dependence on objective curvature/condition number: In CSS, bounds scale inversely with smallest singular value of the optimal submatrix, and in determinant maximization, the gap can be 6 (Altschuler et al., 2016, Gollapudi et al., 2023).
- Computational cost in high-dimensional settings: All-pairs similarity computations or large-scale QR updates may dominate unless structure is exploited (e.g., random projections, sketching) (Altschuler et al., 2016, Moser et al., 26 Sep 2025).
- Model-based vs. model-free selection: Some methods assume access to model gradients or feature spaces (e.g., TAGCOS, RETRIEVE), while others are fully data-driven or training-free (e.g., SubZeroCore).
- Non-monotone or highly non-modular objectives: Theory for greedy maximization in non-monotone or more general supermodular regimes remains open.
Empirical evidence and ablation studies suggest that, with appropriate design, greedy coresets are both scalable and robust to a range of real-world complexities.
7. Applications Across Domains
Contemporary greedy coreset subsampling is deployed in:
- Feature selection and dimensionality reduction via CSS and volume maximization (Altschuler et al., 2016, Gollapudi et al., 2023).
- Clustering and anomaly detection, including distributed and outlier-robust 7-center clustering (Ding et al., 2019, Ding et al., 2023).
- Graph-based learning and mean estimation in sensor networks (Vahidian et al., 2019, Ding et al., 2024).
- Deep learning dataset pruning, model-in-the-loop data selection, and instruction tuning (Zhang et al., 2024, Killamsetty et al., 2021, Moser et al., 26 Sep 2025).
- Bayesian inference acceleration via summarizing log-likelihoods or sufficient statistics (Campbell et al., 2018, Campbell et al., 2019).
The greedy coreset paradigm provides a modular, theoretically-underpinned toolkit for principled data reduction in large-scale data analysis, statistical learning, and distributed computing.