Robust Core-set Computing (RCC)
- Robust Core-set Computing (RCC) is a framework for constructing weighted data summaries (coresets) that reliably approximate query costs even with data heterogeneity and outliers.
- Methodologies in RCC include similarity-based selection, robust element-wise sampling, and group partitioning to ensure strong approximation guarantees under adversarial conditions.
- Empirical validations demonstrate RCC's effectiveness in federated learning, high-dimensional regression, and robust clustering, achieving significant speedups and robust error bounds in streaming and distributed contexts.
Robust Core-set Computing (RCC) refers to algorithmic and mathematical frameworks for constructing representative data summaries—called coresets—that maintain strong approximation guarantees for central learning or optimization problems, even in the presence of adversarial perturbations, data heterogeneity, or outliers. RCC unifies advances in statistical coresets, robust clustering, distributed learning, and adversarial resilience under a rigorous, problem-agnostic theoretical umbrella, with recent literature spanning federated learning, high-dimensional regression, robust clustering, and large-scale streaming contexts.
1. Core Definitions and Problem Setting
A coreset is a weighted subset of an original dataset such that, for a target family of queries or models (e.g., cluster center sets, regression coefficients, classification boundaries), the cost or loss measured on approximates that for to within a given relative (and sometimes additive) error. In RCC, the coreset must remain accurate:
- For multiple models or queries of the relevant family simultaneously (not just for one).
- Under the removal (or arbitrary perturbation) of up to samples—the "robust" or outlier-resistant property.
The robust -coreset property for -clustering with outliers, for example, is
for all (center sets), where
This robust criterion underpins nearly all subsequent guarantees presented across settings, including regression, SVM, and distributed computation. The dependence of coreset size on the number of outliers , clustering complexity (e.g., , ), accuracy , and intrinsic data dimension is a central focus of RCC analyses (Huang et al., 2022, Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025, Jiang et al., 11 Feb 2025, Lu et al., 2019).
2. Methodological Principles and Algorithms
RCC instantiates through multiple constructions, each tuned to the structure of the learning problem and robustness required:
- Core-set selection via similarity and consensus: In federated learning settings, as in the FAROS framework, RCC discards the fragile practice of seeding outlier detection from a single client. Instead, server-side algorithms compute pairwise cosine similarities of pre-processed client gradient updates and select a core-set of size with maximum total mutual similarity ("confidence"). This consensus set then defines a robust centroid—the "core" descriptor for detecting and aggregating benign updates (Hu et al., 5 Jan 2026).
- Robust element-wise and block-based selection: For least squares regression, the "Core-Elements" algorithm selects the largest -magnitude entries per column to form a coreset and uses median-of-means (MOM) blockwise aggregation to withstand heavy-tailed or corrupted observations. This yields provable unbiasedness and error bounds under adversarial contamination (Li et al., 2022).
- Ring/group decompositions and partitioning: Modern RCC for robust clustering leverages geometric decompositions: for each center, points are grouped into "rings" (distance annuli) and small "groups," applying uniform or two-point sampling per ring/group. This enables decoupling the error due to outlier exclusion from the inlier structure, yielding near-optimal coreset sizes linear in plus low-degree polynomials in and (Huang et al., 2022, Huang et al., 15 Jul 2025).
- Black-box reduction to vanilla coresets: The black-box reduction approach formally justifies and quantifies how to convert standard (vanilla) coresets for into robust ones, provided the data admits a bounded-diameter decomposition or the vanilla coreset preserves subset sizes. The resulting extra cost of robustness is an additive term up to or over the vanilla size, matching lower bounds in many regimes (Jiang et al., 11 Feb 2025).
- Streaming/distributed robust coresets: In high-throughput or federated contexts, RCC frameworks adapt blocking, partitioning, coresets via sensitivity or doubling dimension, and compositional operators to build scalable summaries with provable approximation and memory/resource guarantees (Lu et al., 2019, Pietracaprina et al., 2020).
3. Theoretical Guarantees and Complexity
The advances in RCC are characterized by sharp bounds on coreset size and approximation, often improving on previously exponential or suboptimal polynomial dependencies. Key results include:
| Reference | Robust Problem | Coreset Size Bound | Complexity |
|---|---|---|---|
| (Huang et al., 2022) | -clustering w/ out. | Near-linear () | |
| (Huang et al., 15 Jul 2025) | -Medians in VC/dbl dim. | , optimal up to logs | |
| (Fang et al., 28 Oct 2025) | Geometric Median | () | |
| (Jiang et al., 11 Feb 2025) | rob. clust., metric | , streaming possible |
All recent constructions eliminate or separate the "outlier term" , achieving size-optimal coresets as soon as the inlier population exceeds a constant fraction of the dataset. Where this is not possible, dependence on is proven necessary by lower bounds (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025). Complexity for construction is typically near-linear in (data size), and in the streaming or distributed setting, memory and communication can be made sublinear.
Formal error guarantees are typically multiplicative—with no additive error in the "small" cost regime—ensuring reliable model recovery from the coreset for all feasible combinations of outlier set and model.
4. Empirical Validation and Applications
Empirical results underscore the practical impact of RCC:
- Federated learning robustness: In FAROS, the RCC method for gradient aggregation, combined with adaptive scaling, reduces backdoor attack success rates (e.g., down to ASR 0.65 for model replacement, 4.42 for edge-case PGD) without sacrificing accuracy, outperforming baselines that rely on single-point seeding (Hu et al., 5 Jan 2026).
- Regression and high-dimensional statistics: Core-Elements estimators outperform or match conventional subsampling/core methods in prediction accuracy (PMSE, MSE) with dramatically reduced runtime (e.g., speedups of – over IBOSS/OSS/DOPT), and MOM-robustification confers near-optimality under adversarial contamination (Li et al., 2022).
- Clustering and geometric medians: New RCC constructions enable robust -means/medians clustering and geometric median computation with memory and time proportional to the intrinsic dimension and desired robustness—often yielding – speedup in end-to-end clustering for fixed error (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025).
- Streaming, distributed, MapReduce: RCC-based coreset construction supports single-pass or distributed aggregation for robust clustering and center problems (including matroid/knapsack variants), adapting automatically to doubling dimension and scaling globally (Pietracaprina et al., 2020, Lu et al., 2019).
5. Extensions and Open Problems
RCC frameworks generalize to a variety of learning and optimization tasks, including:
- Regression/classification under distributional robustness: Via core-set constraints within ambiguity sets, RCC extends to robustification for least squares, chance-constrained SVMs, and high-dimensional classification, leveraging PCA for scalable SDP reformulations (Li et al., 2022, Li et al., 15 May 2025).
- Beyond -clustering: Component-wise error decomposition is replaced by global (integral/derivative) analysis, enabling further applications in robust PCA, spectral methods, and facility-location-type optimization (Fang et al., 28 Oct 2025, Huang et al., 2022).
- Statistical and information-theoretic tightness: Recent results close the gap between the optimal robust and vanilla (no-outlier) coreset sizes in VC- and doubling-dimension metric spaces, with only an additive linear dependence on , and, in some cases, even weaker -dependence by non-component-wise analyses (Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025).
- Federated and private learning: Robust aggregation and core-set building in adversarial or heterogeneous environments are integral to scalable private and federated learning (Hu et al., 5 Jan 2026, Lu et al., 2019).
Open questions persist, notably:
- Whether the term for robust -medians in Euclidean space can be improved to (Huang et al., 15 Jul 2025).
- How to extend non-component-wise error analysis to robust regression, PCA, and other non-clustering objectives (Fang et al., 28 Oct 2025).
- Investigations into local versus global Lipschitz behavior of more general learning losses.
- Broadening RCC to new objective families, such as deep network loss landscapes, spectral models, or other function classes (Lu et al., 2019).
6. Limitations and Future Directions
While RCC has unified and sharpened the theory of robust summarization, several limitations are documented:
- Assumptions on data and loss: Key guarantees require Lipschitz continuity of the cost with respect to data points, and some constructions require mild assumptions on data geometry (e.g., minimum per-cluster size, bounded aspect ratios).
- Dependence on approximation risk: Relative error tolerances interact nontrivially with , , and ; in some applications, stringent accuracy demands drive up coreset size.
- Complexity under adversarial partitioning: The efficacy of distributed core-set selection can degrade if the data is partitioned adversarially by heterogeneous nodes or by feature/label skew.
- Streaming/online extension: Full derandomization and matching optimal-size streaming/distributed coresets for robust objectives, especially in high-dimensional or graph-structured data, remain open technical challenges (Jiang et al., 11 Feb 2025, Pietracaprina et al., 2020, Fang et al., 28 Oct 2025).
Research continues toward tighter error analysis, adaptive parameter selection, and generalizing robust coreset methods to broader classes of models, networks, and loss functions.
References:
- FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks (Hu et al., 5 Jan 2026)
- Core-Elements for Large-Scale Least Squares Estimation (Li et al., 2022)
- Near-optimal Coresets for Robust Clustering (Huang et al., 2022)
- Coresets for Robust Clustering via Black-box Reductions to Vanilla Case (Jiang et al., 11 Feb 2025)
- Globalized distributionally robust chance-constrained support vector machine based on core sets (Li et al., 15 May 2025)
- Robust Coreset Construction for Distributed Machine Learning (Lu et al., 2019)
- Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers (Fang et al., 28 Oct 2025)
- Coreset-based Strategies for Robust Center-type Problems (Pietracaprina et al., 2020)
- On Tight Robust Coresets for -Medians Clustering (Huang et al., 15 Jul 2025)
- SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection (Moser et al., 26 Sep 2025)