Core-Based Greedy Algorithms
- Core-based greedy algorithms are methods that construct composable core-sets from large datasets to preserve optimization objectives with strong theoretical guarantees.
- They apply greedy selection strategies in tasks like determinant maximization, column subset selection, and k-center clustering with outliers, ensuring scalability in distributed environments.
- Empirical evaluations reveal that these methods offer improved solution quality and computational efficiency compared to LP-based or SVD approaches in high-dimensional settings.
Core-based greedy algorithms refer to a class of methods that construct small, representative summaries (“core-sets”) of large datasets using greedy selection rules, with the aim of preserving optimization objectives of interest under composable or distributed frameworks. These algorithms have gained prominence in settings such as determinant maximization, column subset selection, and clustering with outliers, especially where parallelization, streaming, and scalability are required. They enable effective approximation guarantees by combining the structure of greedy algorithms with provably composable summaries, often exhibiting strong empirical performance and theoretical bounds in both centralized and distributed regimes.
1. Composable Core-sets: Foundations and Definitions
A composable core-set is a mapping assigning to any dataset a small subset , such that for any collection , evaluating the global objective on the merged core-sets achieves a constant-factor approximation (or better) to the same objective on the full union . Formally, for an -composable core-set, it holds that for all ,
where is the optimization objective, and 0 is desirable for computational and communication efficiency.
Randomized composable core-sets extend this to randomized constructions and expectations, particularly in distributed environments where data is partitioned and local summaries are aggregated. The key property is that good local coverage accumulates globally after union and an optional further greedy or local search refinement, enabling near-linear distributed computation for otherwise intractable objectives (Altschuler et al., 2016).
2. Greedy Algorithms for Determinant Maximization and DPP MAP Inference
In determinant maximization (MAXDET1), given 2 and 3, the goal is to select 4, 5, maximizing 6, which encodes the squared volume of the parallelepiped spanned by 7. This problem coincides with MAP inference for 8-DPPs when the kernel is 9.
A greedy approach initializes 0 and iteratively appends the point 1 maximizing its distance from 2, repeating 3 times. This strategy preserves a geometric “directional height” guarantee: for any 4-subspace 5, 6. Via a reduction (Corollary 3.2) any 7-approximate coreset for 8-directional height yields an 9-composable core-set for MAXDET0.
The greedy core-set thus provides an 1 approximation, where 2 is 3. A subsequent local search algorithm—starting from the greedy solution and iteratively performing improving swaps—achieves a tighter 4 bound, with improved directional height preservation 5. These methods are practical and more memory/computation efficient than prior LP-based constructions, while achieving strong empirical and theoretical guarantees on standard datasets (Indyk et al., 2019).
| Algorithm | Core-set Size | Approx. Factor | Principle |
|---|---|---|---|
| Greedy | 6 | 7 | Span-maximization |
| Local Search | 8 | 9 | Swaps for det. gain |
| LP-based (ref) | 0 | Near-optimal | Linear programming |
The effectiveness of these approaches is validated by experimental results where Local Search consistently improves determinant values compared to Greedy (offline LS vs. GD, 5–13% higher; as core-sets, 1.9–9.6% improvement), at moderate additional runtime (Indyk et al., 2019).
3. Greedy Core-Set Methods in Column Subset Selection
Column Subset Selection (CSS) involves selecting 1 columns from 2 maximizing the explained variance 3, with 4 the projector onto 5. The standard greedy (GCSS) iteratively picks the column offering the greatest incremental explained variance. Improved analysis yields a guarantee dependent only on the condition number 6 of the optimal 7-subset, as opposed to the worst-case over all 8-subsets.
Composable randomized greedy core-sets for CSS operate by partitioning data, running local greedy algorithms with “overshoot” (selecting 9 columns per partition), merging summaries, then performing a final greedy pass to extract 0 columns. The guarantee is that, in expectation, the output achieves at least 1 of the optimum, which can be boosted to 2-approximation with 3 passes (Altschuler et al., 2016).
Experimentally, distributed greedy (DistGreedy) methods match or outperform classic baselines (e.g., SVD-based 2-Phase, PCA) in explained variance and downstream classification, particularly in high-dimensional or sparse settings, with 10–704 speedups on large datasets (Altschuler et al., 2016).
4. Core-based Greedy Algorithms for 5-Center Clustering with Outliers
For the 6-center with 7 outliers problem, the objective is to select 8 centers 9 and remove 0 points to minimize the maximum cluster radius over the non-outliers. Greedy core-set-based algorithms here are variants of Gonzalez’s algorithm, adapted for outlier robustness.
The bi-criteria greedy approach constructs a set 1 by:
- Randomly sampling initial points,
- Iteratively identifying the 2 furthest points to the current set,
- Sampling a small batch from these and adding them to 3.
After 4 rounds, 5 (of size 6) provides a 7-approximation in the relaxed 8-center sense, w.h.p. (Ding et al., 2019, Ding et al., 2023). Single-criterion versions exist for 9 with similar approximation.
Coreset construction in doubling metrics leverages the covering property: the inlier subset with doubling dimension 0 can be covered by 1 balls of radius 2. The core-set comprises a weighted summary of representatives plus the furthest 3 points, achieving 4 additive error for any solution.
The resulting core-set size is 5, enabling orders-of-magnitude reductions in downstream optimization time (5–306 faster, preserving 7 loss in radius at 8). Distributed compositions use two communication rounds with 9 points per site (Ding et al., 2023).
| Variant | Core-set Size | Approx. Factor | Metric Requirements |
|---|---|---|---|
| Bi-criteria | 0 | 1 | General metrics |
| Doubling-core | 2 | Additive 3 | Doubling dim. 4 |
5. Computational Complexity and Scalability
Core-based greedy algorithms are designed for strong computational efficiency, particularly as data cardinality or dimension grows.
- Determinant maximization: Greedy runs in 5 total time, with 6 storage; Local Search incurs higher worst-case cost 7; kernelized versions require only inner-product queries (Indyk et al., 2019).
- CSS: Standard greedy and its distributed variants exploit partitioned data and parallelism, with communication proportional to total core-set size, and critical dependence on the minimum singular value of the optimal subset (Altschuler et al., 2016).
- 8-center with outliers: Each round of the greedy clustering costs 9 or 0 in Euclidean 1, total 2. Coreset construction in doubling metrics is 3; sublinear variants avoid dependency on 4 (Ding et al., 2023, Ding et al., 2019).
Empirically, all surveyed algorithms demonstrate speed and memory advantages over global optimization or LP-based regimes, with the core-set approach yielding particular gains in downstream problem-solving stages.
6. Theoretical Guarantees and Proof Techniques
Core-based greedy algorithms leverage a range of theoretical foundations:
- Directional height preservation enables reductions from geometric coverage objectives to volume/determinant maximization, with tight approximation factors analyzable via inductive and projection arguments (Indyk et al., 2019).
- Martingale concentration and Azuma–Hoeffding inequalities are used in 5-center analyses to show rapidly covering all optimal clusters with high probability (Ding et al., 2023, Ding et al., 2019).
- Random partitioning analysis ensures that composable core-sets accumulate global coverage in distributed or streaming models; oversampling via “overshoot” parameters guarantees that optimal solutions are not lost during partition-local selection (Altschuler et al., 2016).
- Core-set size and approximation guarantees for doubling metrics are obtained via successive covers: each optimal cluster is covered by bounded-size sets whose union forms the compressive summary, with error controlled additively or multiplicatively in the doubling dimension (Ding et al., 2023, Ding et al., 2019).
A plausible implication is that such structural proof techniques could be generalized to other non-submodular or high-dimensional objectives, leveraging metric or spectral properties unique to problem classes.
7. Experimental Evaluation and Applications
Extensive empirical studies across determinant maximization (Indyk et al., 2019), column subset selection (Altschuler et al., 2016), and 6-center with outliers (Ding et al., 2023, Ding et al., 2019) demonstrate:
- Consistent improvements in solution quality (e.g., determinant maximization, 7-center radius) versus baseline greedy or LP methods at competitive or lower runtime.
- Strong scaling in high-dimensional and sparse regimes, with distributed core-set aggregation matching or exceeding centralized solution quality with near-linear cost.
- Robustness to outliers and flexibility in real-world data, with summary sizes as low as 10–25% of dataset size preserving objective value within 8.
- Practicability in both synthetic and real datasets, including MNIST, GENES, news20.binary, Shuttle, Covertype, KDD-99, and Poker.
These findings indicate that core-based greedy algorithms are well-suited for large-scale, distributed, and high-dimensional data analysis tasks wherever composable summaries with strong theoretical guarantees are required.