Core-Based Greedy Algorithms

Updated 5 April 2026

Core-based greedy algorithms are methods that construct composable core-sets from large datasets to preserve optimization objectives with strong theoretical guarantees.
They apply greedy selection strategies in tasks like determinant maximization, column subset selection, and k-center clustering with outliers, ensuring scalability in distributed environments.
Empirical evaluations reveal that these methods offer improved solution quality and computational efficiency compared to LP-based or SVD approaches in high-dimensional settings.

Core-based greedy algorithms refer to a class of methods that construct small, representative summaries (“core-sets”) of large datasets using greedy selection rules, with the aim of preserving optimization objectives of interest under composable or distributed frameworks. These algorithms have gained prominence in settings such as determinant maximization, column subset selection, and clustering with outliers, especially where parallelization, streaming, and scalability are required. They enable effective approximation guarantees by combining the structure of greedy algorithms with provably composable summaries, often exhibiting strong empirical performance and theoretical bounds in both centralized and distributed regimes.

1. Composable Core-sets: Foundations and Definitions

A composable core-set is a mapping $c$ assigning to any dataset $P$ a small subset $c(P) \subseteq P$ , such that for any collection $\{P_i\}$ , evaluating the global objective on the merged core-sets $\bigcup_i c(P_i)$ achieves a constant-factor approximation (or better) to the same objective on the full union $\bigcup_i P_i$ . Formally, for an $\alpha$ -composable core-set, it holds that for all $\{P_i\}$ ,

$\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$

where $\mathrm{OPT}(\cdot)$ is the optimization objective, and $P$ 0 is desirable for computational and communication efficiency.

Randomized composable core-sets extend this to randomized constructions and expectations, particularly in distributed environments where data is partitioned and local summaries are aggregated. The key property is that good local coverage accumulates globally after union and an optional further greedy or local search refinement, enabling near-linear distributed computation for otherwise intractable objectives (Altschuler et al., 2016).

2. Greedy Algorithms for Determinant Maximization and DPP MAP Inference

In determinant maximization (MAXDET $P$ 1), given $P$ 2 and $P$ 3, the goal is to select $P$ 4, $P$ 5, maximizing $P$ 6, which encodes the squared volume of the parallelepiped spanned by $P$ 7. This problem coincides with MAP inference for $P$ 8-DPPs when the kernel is $P$ 9.

A greedy approach initializes $c(P) \subseteq P$ 0 and iteratively appends the point $c(P) \subseteq P$ 1 maximizing its distance from $c(P) \subseteq P$ 2, repeating $c(P) \subseteq P$ 3 times. This strategy preserves a geometric “directional height” guarantee: for any $c(P) \subseteq P$ 4-subspace $c(P) \subseteq P$ 5, $c(P) \subseteq P$ 6. Via a reduction (Corollary 3.2) any $c(P) \subseteq P$ 7-approximate coreset for $c(P) \subseteq P$ 8-directional height yields an $c(P) \subseteq P$ 9-composable core-set for MAXDET $\{P_i\}$ 0.

The greedy core-set thus provides an $\{P_i\}$ 1 approximation, where $\{P_i\}$ 2 is $\{P_i\}$ 3. A subsequent local search algorithm—starting from the greedy solution and iteratively performing improving swaps—achieves a tighter $\{P_i\}$ 4 bound, with improved directional height preservation $\{P_i\}$ 5. These methods are practical and more memory/computation efficient than prior LP-based constructions, while achieving strong empirical and theoretical guarantees on standard datasets (Indyk et al., 2019).

Algorithm	Core-set Size	Approx. Factor	Principle
Greedy	$\{P_i\}$ 6	$\{P_i\}$ 7	Span-maximization
Local Search	$\{P_i\}$ 8	$\{P_i\}$ 9	Swaps for det. gain
LP-based (ref)	$\bigcup_i c(P_i)$ 0	Near-optimal	Linear programming

The effectiveness of these approaches is validated by experimental results where Local Search consistently improves determinant values compared to Greedy (offline LS vs. GD, 5–13% higher; as core-sets, 1.9–9.6% improvement), at moderate additional runtime (Indyk et al., 2019).

3. Greedy Core-Set Methods in Column Subset Selection

Column Subset Selection (CSS) involves selecting $\bigcup_i c(P_i)$ 1 columns from $\bigcup_i c(P_i)$ 2 maximizing the explained variance $\bigcup_i c(P_i)$ 3, with $\bigcup_i c(P_i)$ 4 the projector onto $\bigcup_i c(P_i)$ 5. The standard greedy (GCSS) iteratively picks the column offering the greatest incremental explained variance. Improved analysis yields a guarantee dependent only on the condition number $\bigcup_i c(P_i)$ 6 of the optimal $\bigcup_i c(P_i)$ 7-subset, as opposed to the worst-case over all $\bigcup_i c(P_i)$ 8-subsets.

Composable randomized greedy core-sets for CSS operate by partitioning data, running local greedy algorithms with “overshoot” (selecting $\bigcup_i c(P_i)$ 9 columns per partition), merging summaries, then performing a final greedy pass to extract $\bigcup_i P_i$ 0 columns. The guarantee is that, in expectation, the output achieves at least $\bigcup_i P_i$ 1 of the optimum, which can be boosted to $\bigcup_i P_i$ 2-approximation with $\bigcup_i P_i$ 3 passes (Altschuler et al., 2016).

Experimentally, distributed greedy (DistGreedy) methods match or outperform classic baselines (e.g., SVD-based 2-Phase, PCA) in explained variance and downstream classification, particularly in high-dimensional or sparse settings, with 10–70 $\bigcup_i P_i$ 4 speedups on large datasets (Altschuler et al., 2016).

4. Core-based Greedy Algorithms for $\bigcup_i P_i$ 5-Center Clustering with Outliers

For the $\bigcup_i P_i$ 6-center with $\bigcup_i P_i$ 7 outliers problem, the objective is to select $\bigcup_i P_i$ 8 centers $\bigcup_i P_i$ 9 and remove $\alpha$ 0 points to minimize the maximum cluster radius over the non-outliers. Greedy core-set-based algorithms here are variants of Gonzalez’s algorithm, adapted for outlier robustness.

The bi-criteria greedy approach constructs a set $\alpha$ 1 by:

Randomly sampling initial points,
Iteratively identifying the $\alpha$ 2 furthest points to the current set,
Sampling a small batch from these and adding them to $\alpha$ 3.

After $\alpha$ 4 rounds, $\alpha$ 5 (of size $\alpha$ 6) provides a $\alpha$ 7-approximation in the relaxed $\alpha$ 8-center sense, w.h.p. (Ding et al., 2019, Ding et al., 2023). Single-criterion versions exist for $\alpha$ 9 with similar approximation.

Coreset construction in doubling metrics leverages the covering property: the inlier subset with doubling dimension $\{P_i\}$ 0 can be covered by $\{P_i\}$ 1 balls of radius $\{P_i\}$ 2. The core-set comprises a weighted summary of representatives plus the furthest $\{P_i\}$ 3 points, achieving $\{P_i\}$ 4 additive error for any solution.

The resulting core-set size is $\{P_i\}$ 5, enabling orders-of-magnitude reductions in downstream optimization time (5–30 $\{P_i\}$ 6 faster, preserving $\{P_i\}$ 7 loss in radius at $\{P_i\}$ 8). Distributed compositions use two communication rounds with $\{P_i\}$ 9 points per site (Ding et al., 2023).

Variant	Core-set Size	Approx. Factor	Metric Requirements
Bi-criteria	$\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 0	$\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 1	General metrics
Doubling-core	$\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 2	Additive $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 3	Doubling dim. $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 4

5. Computational Complexity and Scalability

Core-based greedy algorithms are designed for strong computational efficiency, particularly as data cardinality or dimension grows.

Determinant maximization: Greedy runs in $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 5 total time, with $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 6 storage; Local Search incurs higher worst-case cost $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 7; kernelized versions require only inner-product queries (Indyk et al., 2019).
CSS: Standard greedy and its distributed variants exploit partitioned data and parallelism, with communication proportional to total core-set size, and critical dependence on the minimum singular value of the optimal subset (Altschuler et al., 2016).
$\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 8-center with outliers: Each round of the greedy clustering costs $\mathrm{OPT} \left( \bigcup_i c(P_i) \right) \geq \frac{1}{\alpha} \cdot \mathrm{OPT} \left( \bigcup_i P_i \right)$ 9 or $\mathrm{OPT}(\cdot)$ 0 in Euclidean $\mathrm{OPT}(\cdot)$ 1, total $\mathrm{OPT}(\cdot)$ 2. Coreset construction in doubling metrics is $\mathrm{OPT}(\cdot)$ 3; sublinear variants avoid dependency on $\mathrm{OPT}(\cdot)$ 4 (Ding et al., 2023, Ding et al., 2019).

Empirically, all surveyed algorithms demonstrate speed and memory advantages over global optimization or LP-based regimes, with the core-set approach yielding particular gains in downstream problem-solving stages.

6. Theoretical Guarantees and Proof Techniques

Core-based greedy algorithms leverage a range of theoretical foundations:

Directional height preservation enables reductions from geometric coverage objectives to volume/determinant maximization, with tight approximation factors analyzable via inductive and projection arguments (Indyk et al., 2019).
Martingale concentration and Azuma–Hoeffding inequalities are used in $\mathrm{OPT}(\cdot)$ 5-center analyses to show rapidly covering all optimal clusters with high probability (Ding et al., 2023, Ding et al., 2019).
Random partitioning analysis ensures that composable core-sets accumulate global coverage in distributed or streaming models; oversampling via “overshoot” parameters guarantees that optimal solutions are not lost during partition-local selection (Altschuler et al., 2016).
Core-set size and approximation guarantees for doubling metrics are obtained via successive covers: each optimal cluster is covered by bounded-size sets whose union forms the compressive summary, with error controlled additively or multiplicatively in the doubling dimension (Ding et al., 2023, Ding et al., 2019).

A plausible implication is that such structural proof techniques could be generalized to other non-submodular or high-dimensional objectives, leveraging metric or spectral properties unique to problem classes.

7. Experimental Evaluation and Applications

Extensive empirical studies across determinant maximization (Indyk et al., 2019), column subset selection (Altschuler et al., 2016), and $\mathrm{OPT}(\cdot)$ 6-center with outliers (Ding et al., 2023, Ding et al., 2019) demonstrate:

Consistent improvements in solution quality (e.g., determinant maximization, $\mathrm{OPT}(\cdot)$ 7-center radius) versus baseline greedy or LP methods at competitive or lower runtime.
Strong scaling in high-dimensional and sparse regimes, with distributed core-set aggregation matching or exceeding centralized solution quality with near-linear cost.
Robustness to outliers and flexibility in real-world data, with summary sizes as low as 10–25% of dataset size preserving objective value within $\mathrm{OPT}(\cdot)$ 8.
Practicability in both synthetic and real datasets, including MNIST, GENES, news20.binary, Shuttle, Covertype, KDD-99, and Poker.

These findings indicate that core-based greedy algorithms are well-suited for large-scale, distributed, and high-dimensional data analysis tasks wherever composable summaries with strong theoretical guarantees are required.

Markdown Report Issue Upgrade to Chat

References (4)

Greedy Column Subset Selection: New Bounds and Distributed Algorithms (2016)

Composable Core-sets for Determinant Maximization: A Simple Near-Optimal Algorithm (2019)

Greedy Strategy Works for $k$-Center Clustering with Outliers and Coreset Construction (2019)

Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Core-Based Greedy Algorithms.

Core-Based Greedy Algorithms

1. Composable Core-sets: Foundations and Definitions

2. Greedy Algorithms for Determinant Maximization and DPP MAP Inference

3. Greedy Core-Set Methods in Column Subset Selection

4. Core-based Greedy Algorithms for $\bigcup_i P_i$ 5-Center Clustering with Outliers

5. Computational Complexity and Scalability

6. Theoretical Guarantees and Proof Techniques

7. Experimental Evaluation and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Core-Based Greedy Algorithms

1. Composable Core-sets: Foundations and Definitions

2. Greedy Algorithms for Determinant Maximization and DPP MAP Inference

3. Greedy Core-Set Methods in Column Subset Selection

4. Core-based Greedy Algorithms for ⋃iPi\bigcup_i P_i⋃i​Pi​5-Center Clustering with Outliers

5. Computational Complexity and Scalability

6. Theoretical Guarantees and Proof Techniques

7. Experimental Evaluation and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

4. Core-based Greedy Algorithms for $\bigcup_i P_i$ 5-Center Clustering with Outliers