Robust Core-set Computing (RCC)

Updated 12 January 2026

Robust Core-set Computing (RCC) is a framework for constructing weighted data summaries (coresets) that reliably approximate query costs even with data heterogeneity and outliers.
Methodologies in RCC include similarity-based selection, robust element-wise sampling, and group partitioning to ensure strong approximation guarantees under adversarial conditions.
Empirical validations demonstrate RCC's effectiveness in federated learning, high-dimensional regression, and robust clustering, achieving significant speedups and robust error bounds in streaming and distributed contexts.

Robust Core-set Computing (RCC) refers to algorithmic and mathematical frameworks for constructing representative data summaries—called coresets—that maintain strong approximation guarantees for central learning or optimization problems, even in the presence of adversarial perturbations, data heterogeneity, or outliers. RCC unifies advances in statistical coresets, robust clustering, distributed learning, and adversarial resilience under a rigorous, problem-agnostic theoretical umbrella, with recent literature spanning federated learning, high-dimensional regression, robust clustering, and large-scale streaming contexts.

1. Core Definitions and Problem Setting

A coreset is a weighted subset $(S, w)$ of an original dataset $X$ such that, for a target family of queries or models (e.g., cluster center sets, regression coefficients, classification boundaries), the cost or loss measured on $S$ approximates that for $X$ to within a given relative (and sometimes additive) error. In RCC, the coreset must remain accurate:

For multiple models or queries of the relevant family simultaneously (not just for one).
Under the removal (or arbitrary perturbation) of up to $m$ samples—the "robust" or outlier-resistant property.

The robust $\varepsilon$ -coreset property for $k$ -clustering with $m$ outliers, for example, is

$\big| \mathrm{cost}_z^{(m)}(X, C) - \mathrm{cost}_z^{(m)}(S, C) \big| \leq \varepsilon\,\mathrm{cost}_z^{(m)}(X, C)$

for all $C$ (center sets), where

$\mathrm{cost}_z^{(m)}(X,C) = \min_{L \subset X, |L| = m} \sum_{x \in X \setminus L} \mathrm{dist}(x, C)^z.$

This robust criterion underpins nearly all subsequent guarantees presented across settings, including regression, SVM, and distributed computation. The dependence of coreset size on the number of outliers $m$ , clustering complexity (e.g., $k$ , $z$ ), accuracy $\varepsilon$ , and intrinsic data dimension is a central focus of RCC analyses (Huang et al., 2022, Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025, Jiang et al., 11 Feb 2025, Lu et al., 2019).

2. Methodological Principles and Algorithms

RCC instantiates through multiple constructions, each tuned to the structure of the learning problem and robustness required:

Core-set selection via similarity and consensus: In federated learning settings, as in the FAROS framework, RCC discards the fragile practice of seeding outlier detection from a single client. Instead, server-side algorithms compute pairwise cosine similarities of pre-processed client gradient updates and select a core-set $S^*$ of size $\ell$ with maximum total mutual similarity ("confidence"). This consensus set then defines a robust centroid—the "core" descriptor for detecting and aggregating benign updates (Hu et al., 5 Jan 2026).
Robust element-wise and block-based selection: For least squares regression, the "Core-Elements" algorithm selects the largest $r$ -magnitude entries per column to form a coreset $X^*$ and uses median-of-means (MOM) blockwise aggregation to withstand heavy-tailed or corrupted observations. This yields provable unbiasedness and error bounds under adversarial contamination (Li et al., 2022).
Ring/group decompositions and partitioning: Modern RCC for robust clustering leverages geometric decompositions: for each center, points are grouped into "rings" (distance annuli) and small "groups," applying uniform or two-point sampling per ring/group. This enables decoupling the error due to outlier exclusion from the inlier structure, yielding near-optimal coreset sizes linear in $m$ plus low-degree polynomials in $k$ and $\varepsilon$ (Huang et al., 2022, Huang et al., 15 Jul 2025).
Black-box reduction to vanilla coresets: The black-box reduction approach formally justifies and quantifies how to convert standard (vanilla) coresets for $m=0$ into robust ones, provided the data admits a bounded-diameter decomposition or the vanilla coreset preserves subset sizes. The resulting extra cost of robustness is an additive term up to $O(m)$ or $O(m\varepsilon^{-2z}\log^z(km/\varepsilon))$ over the vanilla size, matching lower bounds in many regimes (Jiang et al., 11 Feb 2025).
Streaming/distributed robust coresets: In high-throughput or federated contexts, RCC frameworks adapt blocking, partitioning, coresets via sensitivity or doubling dimension, and compositional operators to build scalable summaries with provable approximation and memory/resource guarantees (Lu et al., 2019, Pietracaprina et al., 2020).

3. Theoretical Guarantees and Complexity

The advances in RCC are characterized by sharp bounds on coreset size and approximation, often improving on previously exponential or suboptimal polynomial dependencies. Key results include:

Reference	Robust Problem	Coreset Size Bound	Complexity
(Huang et al., 2022)	$(k, z)$ -clustering w/ $m$ out.	$O(m) + \tilde{O}(k^3\varepsilon^{-3z-2})$	Near-linear ( $O(nkd)$ )
(Huang et al., 15 Jul 2025)	$k$ -Medians in VC/dbl dim. $d$	$O(m) + \tilde{O}(kd\varepsilon^{-2})$	$O(nk)$ , optimal up to logs
(Fang et al., 28 Oct 2025)	Geometric Median $(d \geq 1)$	$\tilde{O}(\varepsilon^{-2}\min\{\varepsilon^{-2},d\})$ ( $n \geq 4m$ )	$O(nd)$
(Jiang et al., 11 Feb 2025)	$(k, z)$ rob. clust., metric	$N\cdot \text{polylog}(km/\varepsilon) + O(m\varepsilon^{-2z}\log^z(km/\varepsilon))$	$O(nk)$ , streaming possible

All recent constructions eliminate or separate the "outlier term" $O(m)$ , achieving size-optimal coresets as soon as the inlier population exceeds a constant fraction of the dataset. Where this is not possible, dependence on $m$ is proven necessary by lower bounds (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025). Complexity for construction is typically near-linear in $n$ (data size), and in the streaming or distributed setting, memory and communication can be made sublinear.

Formal error guarantees are typically multiplicative—with no additive error in the "small" cost regime—ensuring reliable model recovery from the coreset for all feasible combinations of outlier set and model.

4. Empirical Validation and Applications

Empirical results underscore the practical impact of RCC:

Federated learning robustness: In FAROS, the RCC method for gradient aggregation, combined with adaptive scaling, reduces backdoor attack success rates (e.g., down to ASR 0.65 for model replacement, 4.42 for edge-case PGD) without sacrificing accuracy, outperforming baselines that rely on single-point seeding (Hu et al., 5 Jan 2026).
Regression and high-dimensional statistics: Core-Elements estimators outperform or match conventional subsampling/core methods in prediction accuracy (PMSE, MSE) with dramatically reduced runtime (e.g., speedups of $10\times$ – $50\times$ over IBOSS/OSS/DOPT), and MOM-robustification confers near-optimality under adversarial contamination (Li et al., 2022).
Clustering and geometric medians: New RCC constructions enable robust $k$ -means/medians clustering and geometric median computation with memory and time proportional to the intrinsic dimension and desired robustness—often yielding $2\times$ – $140\times$ speedup in end-to-end clustering for fixed error (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025).
Streaming, distributed, MapReduce: RCC-based coreset construction supports single-pass or distributed aggregation for robust clustering and center problems (including matroid/knapsack variants), adapting automatically to doubling dimension and scaling globally (Pietracaprina et al., 2020, Lu et al., 2019).

5. Extensions and Open Problems

RCC frameworks generalize to a variety of learning and optimization tasks, including:

Regression/classification under distributional robustness: Via core-set constraints within ambiguity sets, RCC extends to robustification for least squares, chance-constrained SVMs, and high-dimensional classification, leveraging PCA for scalable SDP reformulations (Li et al., 2022, Li et al., 15 May 2025).
Beyond $k$ -clustering: Component-wise error decomposition is replaced by global (integral/derivative) analysis, enabling further applications in robust PCA, spectral methods, and facility-location-type optimization (Fang et al., 28 Oct 2025, Huang et al., 2022).
Statistical and information-theoretic tightness: Recent results close the gap between the optimal robust and vanilla (no-outlier) coreset sizes in VC- and doubling-dimension metric spaces, with only an additive linear dependence on $m$ , and, in some cases, even weaker $m$ -dependence by non-component-wise analyses (Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025).
Federated and private learning: Robust aggregation and core-set building in adversarial or heterogeneous environments are integral to scalable private and federated learning (Hu et al., 5 Jan 2026, Lu et al., 2019).

Open questions persist, notably:

Whether the $O(m\varepsilon^{-1})$ term for robust $k$ -medians in Euclidean space can be improved to $O(m)$ (Huang et al., 15 Jul 2025).
How to extend non-component-wise error analysis to robust regression, PCA, and other non-clustering objectives (Fang et al., 28 Oct 2025).
Investigations into local versus global Lipschitz behavior of more general learning losses.
Broadening RCC to new objective families, such as deep network loss landscapes, spectral models, or other function classes (Lu et al., 2019).

6. Limitations and Future Directions

While RCC has unified and sharpened the theory of robust summarization, several limitations are documented:

Assumptions on data and loss: Key guarantees require Lipschitz continuity of the cost with respect to data points, and some constructions require mild assumptions on data geometry (e.g., minimum per-cluster size, bounded aspect ratios).
Dependence on approximation risk: Relative error tolerances $\varepsilon$ interact nontrivially with $k$ , $m$ , and $z$ ; in some applications, stringent accuracy demands drive up coreset size.
Complexity under adversarial partitioning: The efficacy of distributed core-set selection can degrade if the data is partitioned adversarially by heterogeneous nodes or by feature/label skew.
Streaming/online extension: Full derandomization and matching optimal-size streaming/distributed coresets for robust objectives, especially in high-dimensional or graph-structured data, remain open technical challenges (Jiang et al., 11 Feb 2025, Pietracaprina et al., 2020, Fang et al., 28 Oct 2025).

Research continues toward tighter error analysis, adaptive parameter selection, and generalizing robust coreset methods to broader classes of models, networks, and loss functions.

References:

FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks (Hu et al., 5 Jan 2026)
Core-Elements for Large-Scale Least Squares Estimation (Li et al., 2022)
Near-optimal Coresets for Robust Clustering (Huang et al., 2022)
Coresets for Robust Clustering via Black-box Reductions to Vanilla Case (Jiang et al., 11 Feb 2025)
Globalized distributionally robust chance-constrained support vector machine based on core sets (Li et al., 15 May 2025)
Robust Coreset Construction for Distributed Machine Learning (Lu et al., 2019)
Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers (Fang et al., 28 Oct 2025)
Coreset-based Strategies for Robust Center-type Problems (Pietracaprina et al., 2020)
On Tight Robust Coresets for $k$ -Medians Clustering (Huang et al., 15 Jul 2025)
SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection (Moser et al., 26 Sep 2025)

Markdown Upgrade to Chat

References (10)

Near-optimal Coresets for Robust Clustering (2022)

On Tight Robust Coresets for $k$-Medians Clustering (2025)

Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers (2025)

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case (2025)

Robust Coreset Construction for Distributed Machine Learning (2019)

FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks (2026)

Core-Elements for Large-Scale Least Squares Estimation (2022)

Coreset-based Strategies for Robust Center-type Problems (2020)

Globalized distributionally robust chance-constrained support vector machine based on core sets (2025)

10.

SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Robust Core-set Computing (RCC).

Robust Core-set Computing (RCC)

1. Core Definitions and Problem Setting

2. Methodological Principles and Algorithms

3. Theoretical Guarantees and Complexity

4. Empirical Validation and Applications

5. Extensions and Open Problems

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Robust Core-set Computing (RCC)

1. Core Definitions and Problem Setting

2. Methodological Principles and Algorithms

3. Theoretical Guarantees and Complexity

4. Empirical Validation and Applications

5. Extensions and Open Problems

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research