Papers
Topics
Authors
Recent
Search
2000 character limit reached

Core-Based Greedy Algorithms

Updated 5 April 2026
  • Core-based greedy algorithms are methods that construct composable core-sets from large datasets to preserve optimization objectives with strong theoretical guarantees.
  • They apply greedy selection strategies in tasks like determinant maximization, column subset selection, and k-center clustering with outliers, ensuring scalability in distributed environments.
  • Empirical evaluations reveal that these methods offer improved solution quality and computational efficiency compared to LP-based or SVD approaches in high-dimensional settings.

Core-based greedy algorithms refer to a class of methods that construct small, representative summaries (“core-sets”) of large datasets using greedy selection rules, with the aim of preserving optimization objectives of interest under composable or distributed frameworks. These algorithms have gained prominence in settings such as determinant maximization, column subset selection, and clustering with outliers, especially where parallelization, streaming, and scalability are required. They enable effective approximation guarantees by combining the structure of greedy algorithms with provably composable summaries, often exhibiting strong empirical performance and theoretical bounds in both centralized and distributed regimes.

1. Composable Core-sets: Foundations and Definitions

A composable core-set is a mapping cc assigning to any dataset PP a small subset c(P)Pc(P) \subseteq P, such that for any collection {Pi}\{P_i\}, evaluating the global objective on the merged core-sets ic(Pi)\bigcup_i c(P_i) achieves a constant-factor approximation (or better) to the same objective on the full union iPi\bigcup_i P_i. Formally, for an α\alpha-composable core-set, it holds that for all {Pi}\{P_i\},

OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)

where OPT()\mathrm{OPT}(\cdot) is the optimization objective, and PP0 is desirable for computational and communication efficiency.

Randomized composable core-sets extend this to randomized constructions and expectations, particularly in distributed environments where data is partitioned and local summaries are aggregated. The key property is that good local coverage accumulates globally after union and an optional further greedy or local search refinement, enabling near-linear distributed computation for otherwise intractable objectives (Altschuler et al., 2016).

2. Greedy Algorithms for Determinant Maximization and DPP MAP Inference

In determinant maximization (MAXDETPP1), given PP2 and PP3, the goal is to select PP4, PP5, maximizing PP6, which encodes the squared volume of the parallelepiped spanned by PP7. This problem coincides with MAP inference for PP8-DPPs when the kernel is PP9.

A greedy approach initializes c(P)Pc(P) \subseteq P0 and iteratively appends the point c(P)Pc(P) \subseteq P1 maximizing its distance from c(P)Pc(P) \subseteq P2, repeating c(P)Pc(P) \subseteq P3 times. This strategy preserves a geometric “directional height” guarantee: for any c(P)Pc(P) \subseteq P4-subspace c(P)Pc(P) \subseteq P5, c(P)Pc(P) \subseteq P6. Via a reduction (Corollary 3.2) any c(P)Pc(P) \subseteq P7-approximate coreset for c(P)Pc(P) \subseteq P8-directional height yields an c(P)Pc(P) \subseteq P9-composable core-set for MAXDET{Pi}\{P_i\}0.

The greedy core-set thus provides an {Pi}\{P_i\}1 approximation, where {Pi}\{P_i\}2 is {Pi}\{P_i\}3. A subsequent local search algorithm—starting from the greedy solution and iteratively performing improving swaps—achieves a tighter {Pi}\{P_i\}4 bound, with improved directional height preservation {Pi}\{P_i\}5. These methods are practical and more memory/computation efficient than prior LP-based constructions, while achieving strong empirical and theoretical guarantees on standard datasets (Indyk et al., 2019).

Algorithm Core-set Size Approx. Factor Principle
Greedy {Pi}\{P_i\}6 {Pi}\{P_i\}7 Span-maximization
Local Search {Pi}\{P_i\}8 {Pi}\{P_i\}9 Swaps for det. gain
LP-based (ref) ic(Pi)\bigcup_i c(P_i)0 Near-optimal Linear programming

The effectiveness of these approaches is validated by experimental results where Local Search consistently improves determinant values compared to Greedy (offline LS vs. GD, 5–13% higher; as core-sets, 1.9–9.6% improvement), at moderate additional runtime (Indyk et al., 2019).

3. Greedy Core-Set Methods in Column Subset Selection

Column Subset Selection (CSS) involves selecting ic(Pi)\bigcup_i c(P_i)1 columns from ic(Pi)\bigcup_i c(P_i)2 maximizing the explained variance ic(Pi)\bigcup_i c(P_i)3, with ic(Pi)\bigcup_i c(P_i)4 the projector onto ic(Pi)\bigcup_i c(P_i)5. The standard greedy (GCSS) iteratively picks the column offering the greatest incremental explained variance. Improved analysis yields a guarantee dependent only on the condition number ic(Pi)\bigcup_i c(P_i)6 of the optimal ic(Pi)\bigcup_i c(P_i)7-subset, as opposed to the worst-case over all ic(Pi)\bigcup_i c(P_i)8-subsets.

Composable randomized greedy core-sets for CSS operate by partitioning data, running local greedy algorithms with “overshoot” (selecting ic(Pi)\bigcup_i c(P_i)9 columns per partition), merging summaries, then performing a final greedy pass to extract iPi\bigcup_i P_i0 columns. The guarantee is that, in expectation, the output achieves at least iPi\bigcup_i P_i1 of the optimum, which can be boosted to iPi\bigcup_i P_i2-approximation with iPi\bigcup_i P_i3 passes (Altschuler et al., 2016).

Experimentally, distributed greedy (DistGreedy) methods match or outperform classic baselines (e.g., SVD-based 2-Phase, PCA) in explained variance and downstream classification, particularly in high-dimensional or sparse settings, with 10–70iPi\bigcup_i P_i4 speedups on large datasets (Altschuler et al., 2016).

4. Core-based Greedy Algorithms for iPi\bigcup_i P_i5-Center Clustering with Outliers

For the iPi\bigcup_i P_i6-center with iPi\bigcup_i P_i7 outliers problem, the objective is to select iPi\bigcup_i P_i8 centers iPi\bigcup_i P_i9 and remove α\alpha0 points to minimize the maximum cluster radius over the non-outliers. Greedy core-set-based algorithms here are variants of Gonzalez’s algorithm, adapted for outlier robustness.

The bi-criteria greedy approach constructs a set α\alpha1 by:

  1. Randomly sampling initial points,
  2. Iteratively identifying the α\alpha2 furthest points to the current set,
  3. Sampling a small batch from these and adding them to α\alpha3.

After α\alpha4 rounds, α\alpha5 (of size α\alpha6) provides a α\alpha7-approximation in the relaxed α\alpha8-center sense, w.h.p. (Ding et al., 2019, Ding et al., 2023). Single-criterion versions exist for α\alpha9 with similar approximation.

Coreset construction in doubling metrics leverages the covering property: the inlier subset with doubling dimension {Pi}\{P_i\}0 can be covered by {Pi}\{P_i\}1 balls of radius {Pi}\{P_i\}2. The core-set comprises a weighted summary of representatives plus the furthest {Pi}\{P_i\}3 points, achieving {Pi}\{P_i\}4 additive error for any solution.

The resulting core-set size is {Pi}\{P_i\}5, enabling orders-of-magnitude reductions in downstream optimization time (5–30{Pi}\{P_i\}6 faster, preserving {Pi}\{P_i\}7 loss in radius at {Pi}\{P_i\}8). Distributed compositions use two communication rounds with {Pi}\{P_i\}9 points per site (Ding et al., 2023).

Variant Core-set Size Approx. Factor Metric Requirements
Bi-criteria OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)0 OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)1 General metrics
Doubling-core OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)2 Additive OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)3 Doubling dim. OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)4

5. Computational Complexity and Scalability

Core-based greedy algorithms are designed for strong computational efficiency, particularly as data cardinality or dimension grows.

  • Determinant maximization: Greedy runs in OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)5 total time, with OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)6 storage; Local Search incurs higher worst-case cost OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)7; kernelized versions require only inner-product queries (Indyk et al., 2019).
  • CSS: Standard greedy and its distributed variants exploit partitioned data and parallelism, with communication proportional to total core-set size, and critical dependence on the minimum singular value of the optimal subset (Altschuler et al., 2016).
  • OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)8-center with outliers: Each round of the greedy clustering costs OPT(ic(Pi))1αOPT(iPi)\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)9 or OPT()\mathrm{OPT}(\cdot)0 in Euclidean OPT()\mathrm{OPT}(\cdot)1, total OPT()\mathrm{OPT}(\cdot)2. Coreset construction in doubling metrics is OPT()\mathrm{OPT}(\cdot)3; sublinear variants avoid dependency on OPT()\mathrm{OPT}(\cdot)4 (Ding et al., 2023, Ding et al., 2019).

Empirically, all surveyed algorithms demonstrate speed and memory advantages over global optimization or LP-based regimes, with the core-set approach yielding particular gains in downstream problem-solving stages.

6. Theoretical Guarantees and Proof Techniques

Core-based greedy algorithms leverage a range of theoretical foundations:

  • Directional height preservation enables reductions from geometric coverage objectives to volume/determinant maximization, with tight approximation factors analyzable via inductive and projection arguments (Indyk et al., 2019).
  • Martingale concentration and Azuma–Hoeffding inequalities are used in OPT()\mathrm{OPT}(\cdot)5-center analyses to show rapidly covering all optimal clusters with high probability (Ding et al., 2023, Ding et al., 2019).
  • Random partitioning analysis ensures that composable core-sets accumulate global coverage in distributed or streaming models; oversampling via “overshoot” parameters guarantees that optimal solutions are not lost during partition-local selection (Altschuler et al., 2016).
  • Core-set size and approximation guarantees for doubling metrics are obtained via successive covers: each optimal cluster is covered by bounded-size sets whose union forms the compressive summary, with error controlled additively or multiplicatively in the doubling dimension (Ding et al., 2023, Ding et al., 2019).

A plausible implication is that such structural proof techniques could be generalized to other non-submodular or high-dimensional objectives, leveraging metric or spectral properties unique to problem classes.

7. Experimental Evaluation and Applications

Extensive empirical studies across determinant maximization (Indyk et al., 2019), column subset selection (Altschuler et al., 2016), and OPT()\mathrm{OPT}(\cdot)6-center with outliers (Ding et al., 2023, Ding et al., 2019) demonstrate:

  • Consistent improvements in solution quality (e.g., determinant maximization, OPT()\mathrm{OPT}(\cdot)7-center radius) versus baseline greedy or LP methods at competitive or lower runtime.
  • Strong scaling in high-dimensional and sparse regimes, with distributed core-set aggregation matching or exceeding centralized solution quality with near-linear cost.
  • Robustness to outliers and flexibility in real-world data, with summary sizes as low as 10–25% of dataset size preserving objective value within OPT()\mathrm{OPT}(\cdot)8.
  • Practicability in both synthetic and real datasets, including MNIST, GENES, news20.binary, Shuttle, Covertype, KDD-99, and Poker.

These findings indicate that core-based greedy algorithms are well-suited for large-scale, distributed, and high-dimensional data analysis tasks wherever composable summaries with strong theoretical guarantees are required.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Core-Based Greedy Algorithms.