Papers
Topics
Authors
Recent
2000 character limit reached

Robust Core-set Computing (RCC)

Updated 12 January 2026
  • Robust Core-set Computing (RCC) is a framework for constructing weighted data summaries (coresets) that reliably approximate query costs even with data heterogeneity and outliers.
  • Methodologies in RCC include similarity-based selection, robust element-wise sampling, and group partitioning to ensure strong approximation guarantees under adversarial conditions.
  • Empirical validations demonstrate RCC's effectiveness in federated learning, high-dimensional regression, and robust clustering, achieving significant speedups and robust error bounds in streaming and distributed contexts.

Robust Core-set Computing (RCC) refers to algorithmic and mathematical frameworks for constructing representative data summaries—called coresets—that maintain strong approximation guarantees for central learning or optimization problems, even in the presence of adversarial perturbations, data heterogeneity, or outliers. RCC unifies advances in statistical coresets, robust clustering, distributed learning, and adversarial resilience under a rigorous, problem-agnostic theoretical umbrella, with recent literature spanning federated learning, high-dimensional regression, robust clustering, and large-scale streaming contexts.

1. Core Definitions and Problem Setting

A coreset is a weighted subset (S,w)(S, w) of an original dataset XX such that, for a target family of queries or models (e.g., cluster center sets, regression coefficients, classification boundaries), the cost or loss measured on SS approximates that for XX to within a given relative (and sometimes additive) error. In RCC, the coreset must remain accurate:

  • For multiple models or queries of the relevant family simultaneously (not just for one).
  • Under the removal (or arbitrary perturbation) of up to mm samples—the "robust" or outlier-resistant property.

The robust ε\varepsilon-coreset property for kk-clustering with mm outliers, for example, is

costz(m)(X,C)costz(m)(S,C)εcostz(m)(X,C)\big| \mathrm{cost}_z^{(m)}(X, C) - \mathrm{cost}_z^{(m)}(S, C) \big| \leq \varepsilon\,\mathrm{cost}_z^{(m)}(X, C)

for all CC (center sets), where

costz(m)(X,C)=minLX,L=mxXLdist(x,C)z.\mathrm{cost}_z^{(m)}(X,C) = \min_{L \subset X, |L| = m} \sum_{x \in X \setminus L} \mathrm{dist}(x, C)^z.

This robust criterion underpins nearly all subsequent guarantees presented across settings, including regression, SVM, and distributed computation. The dependence of coreset size on the number of outliers mm, clustering complexity (e.g., kk, zz), accuracy ε\varepsilon, and intrinsic data dimension is a central focus of RCC analyses (Huang et al., 2022, Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025, Jiang et al., 11 Feb 2025, Lu et al., 2019).

2. Methodological Principles and Algorithms

RCC instantiates through multiple constructions, each tuned to the structure of the learning problem and robustness required:

  • Core-set selection via similarity and consensus: In federated learning settings, as in the FAROS framework, RCC discards the fragile practice of seeding outlier detection from a single client. Instead, server-side algorithms compute pairwise cosine similarities of pre-processed client gradient updates and select a core-set SS^* of size \ell with maximum total mutual similarity ("confidence"). This consensus set then defines a robust centroid—the "core" descriptor for detecting and aggregating benign updates (Hu et al., 5 Jan 2026).
  • Robust element-wise and block-based selection: For least squares regression, the "Core-Elements" algorithm selects the largest rr-magnitude entries per column to form a coreset XX^* and uses median-of-means (MOM) blockwise aggregation to withstand heavy-tailed or corrupted observations. This yields provable unbiasedness and error bounds under adversarial contamination (Li et al., 2022).
  • Ring/group decompositions and partitioning: Modern RCC for robust clustering leverages geometric decompositions: for each center, points are grouped into "rings" (distance annuli) and small "groups," applying uniform or two-point sampling per ring/group. This enables decoupling the error due to outlier exclusion from the inlier structure, yielding near-optimal coreset sizes linear in mm plus low-degree polynomials in kk and ε\varepsilon (Huang et al., 2022, Huang et al., 15 Jul 2025).
  • Black-box reduction to vanilla coresets: The black-box reduction approach formally justifies and quantifies how to convert standard (vanilla) coresets for m=0m=0 into robust ones, provided the data admits a bounded-diameter decomposition or the vanilla coreset preserves subset sizes. The resulting extra cost of robustness is an additive term up to O(m)O(m) or O(mε2zlogz(km/ε))O(m\varepsilon^{-2z}\log^z(km/\varepsilon)) over the vanilla size, matching lower bounds in many regimes (Jiang et al., 11 Feb 2025).
  • Streaming/distributed robust coresets: In high-throughput or federated contexts, RCC frameworks adapt blocking, partitioning, coresets via sensitivity or doubling dimension, and compositional operators to build scalable summaries with provable approximation and memory/resource guarantees (Lu et al., 2019, Pietracaprina et al., 2020).

3. Theoretical Guarantees and Complexity

The advances in RCC are characterized by sharp bounds on coreset size and approximation, often improving on previously exponential or suboptimal polynomial dependencies. Key results include:

Reference Robust Problem Coreset Size Bound Complexity
(Huang et al., 2022) (k,z)(k, z)-clustering w/ mm out. O(m)+O~(k3ε3z2)O(m) + \tilde{O}(k^3\varepsilon^{-3z-2}) Near-linear (O(nkd)O(nkd))
(Huang et al., 15 Jul 2025) kk-Medians in VC/dbl dim. dd O(m)+O~(kdε2)O(m) + \tilde{O}(kd\varepsilon^{-2}) O(nk)O(nk), optimal up to logs
(Fang et al., 28 Oct 2025) Geometric Median (d1)(d \geq 1) O~(ε2min{ε2,d})\tilde{O}(\varepsilon^{-2}\min\{\varepsilon^{-2},d\}) (n4mn \geq 4m) O(nd)O(nd)
(Jiang et al., 11 Feb 2025) (k,z)(k, z) rob. clust., metric Npolylog(km/ε)+O(mε2zlogz(km/ε))N\cdot \text{polylog}(km/\varepsilon) + O(m\varepsilon^{-2z}\log^z(km/\varepsilon)) O(nk)O(nk), streaming possible

All recent constructions eliminate or separate the "outlier term" O(m)O(m), achieving size-optimal coresets as soon as the inlier population exceeds a constant fraction of the dataset. Where this is not possible, dependence on mm is proven necessary by lower bounds (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025). Complexity for construction is typically near-linear in nn (data size), and in the streaming or distributed setting, memory and communication can be made sublinear.

Formal error guarantees are typically multiplicative—with no additive error in the "small" cost regime—ensuring reliable model recovery from the coreset for all feasible combinations of outlier set and model.

4. Empirical Validation and Applications

Empirical results underscore the practical impact of RCC:

  • Federated learning robustness: In FAROS, the RCC method for gradient aggregation, combined with adaptive scaling, reduces backdoor attack success rates (e.g., down to ASR 0.65 for model replacement, 4.42 for edge-case PGD) without sacrificing accuracy, outperforming baselines that rely on single-point seeding (Hu et al., 5 Jan 2026).
  • Regression and high-dimensional statistics: Core-Elements estimators outperform or match conventional subsampling/core methods in prediction accuracy (PMSE, MSE) with dramatically reduced runtime (e.g., speedups of 10×10\times50×50\times over IBOSS/OSS/DOPT), and MOM-robustification confers near-optimality under adversarial contamination (Li et al., 2022).
  • Clustering and geometric medians: New RCC constructions enable robust kk-means/medians clustering and geometric median computation with memory and time proportional to the intrinsic dimension and desired robustness—often yielding 2×2\times140×140\times speedup in end-to-end clustering for fixed error (Huang et al., 2022, Fang et al., 28 Oct 2025, Huang et al., 15 Jul 2025).
  • Streaming, distributed, MapReduce: RCC-based coreset construction supports single-pass or distributed aggregation for robust clustering and center problems (including matroid/knapsack variants), adapting automatically to doubling dimension and scaling globally (Pietracaprina et al., 2020, Lu et al., 2019).

5. Extensions and Open Problems

RCC frameworks generalize to a variety of learning and optimization tasks, including:

  • Regression/classification under distributional robustness: Via core-set constraints within ambiguity sets, RCC extends to robustification for least squares, chance-constrained SVMs, and high-dimensional classification, leveraging PCA for scalable SDP reformulations (Li et al., 2022, Li et al., 15 May 2025).
  • Beyond kk-clustering: Component-wise error decomposition is replaced by global (integral/derivative) analysis, enabling further applications in robust PCA, spectral methods, and facility-location-type optimization (Fang et al., 28 Oct 2025, Huang et al., 2022).
  • Statistical and information-theoretic tightness: Recent results close the gap between the optimal robust and vanilla (no-outlier) coreset sizes in VC- and doubling-dimension metric spaces, with only an additive linear dependence on mm, and, in some cases, even weaker mm-dependence by non-component-wise analyses (Huang et al., 15 Jul 2025, Fang et al., 28 Oct 2025).
  • Federated and private learning: Robust aggregation and core-set building in adversarial or heterogeneous environments are integral to scalable private and federated learning (Hu et al., 5 Jan 2026, Lu et al., 2019).

Open questions persist, notably:

  • Whether the O(mε1)O(m\varepsilon^{-1}) term for robust kk-medians in Euclidean space can be improved to O(m)O(m) (Huang et al., 15 Jul 2025).
  • How to extend non-component-wise error analysis to robust regression, PCA, and other non-clustering objectives (Fang et al., 28 Oct 2025).
  • Investigations into local versus global Lipschitz behavior of more general learning losses.
  • Broadening RCC to new objective families, such as deep network loss landscapes, spectral models, or other function classes (Lu et al., 2019).

6. Limitations and Future Directions

While RCC has unified and sharpened the theory of robust summarization, several limitations are documented:

  • Assumptions on data and loss: Key guarantees require Lipschitz continuity of the cost with respect to data points, and some constructions require mild assumptions on data geometry (e.g., minimum per-cluster size, bounded aspect ratios).
  • Dependence on approximation risk: Relative error tolerances ε\varepsilon interact nontrivially with kk, mm, and zz; in some applications, stringent accuracy demands drive up coreset size.
  • Complexity under adversarial partitioning: The efficacy of distributed core-set selection can degrade if the data is partitioned adversarially by heterogeneous nodes or by feature/label skew.
  • Streaming/online extension: Full derandomization and matching optimal-size streaming/distributed coresets for robust objectives, especially in high-dimensional or graph-structured data, remain open technical challenges (Jiang et al., 11 Feb 2025, Pietracaprina et al., 2020, Fang et al., 28 Oct 2025).

Research continues toward tighter error analysis, adaptive parameter selection, and generalizing robust coreset methods to broader classes of models, networks, and loss functions.


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Robust Core-set Computing (RCC).