Papers
Topics
Authors
Recent
Search
2000 character limit reached

Federated Clustering: Privacy-Preserving Learning

Updated 17 January 2026
  • Federated clustering is a decentralized, privacy-preserving unsupervised learning approach that aggregates locally computed models to infer global data partitions.
  • It employs methods like federated k-means, fuzzy c-means, and deep clustering techniques enhanced with secure aggregation and differential privacy.
  • Recent advances focus on robustness to non-IID data, asynchronous updates, and efficient machine unlearning for fast, privacy-compliant cluster updates.

Federated clustering (FC) is a class of decentralized unsupervised learning protocols that enable multiple clients—each holding private, typically heterogeneous, unlabeled data—to jointly infer global data groupings or cluster structures while preventing direct exchange of raw data. Motivated by privacy regulations and the prevalence of distributed data (e.g., in healthcare, banking, IoT), FC generalizes classic clustering objectives (e.g., kk-means, spectral, deep cluster analysis) to architectures that restrict information sharing to models or privatized summaries. The resulting landscape features algorithmic innovations, formal privacy guarantees, and substantial advances in robust unsupervised inference under non-IID conditions.

1. Core Principles and Problem Formulation

In FC, MM clients possess local datasets {Di}i=1M\{D_i\}_{i=1}^M (each DiD_i drawn from PiP_i), and the collective goal is to recover a global partition, typically minimizing a centralized cost such as

minC,Ai=1MxDi(x;Ca(x))\min_{C,A} \sum_{i=1}^M \sum_{x \in D_i} \ell\big(x; C_{a(x)}\big)

where CC are cluster centroids (or prototypes), a(x)a(x) is xx's cluster assignment, and \ell is an affinity or distortion loss (Euclidean distance, negative log-likelihood, or more generally, a deep embedding loss as in representation clustering).

A principal challenge is the inability to compute affinities or means across client boundaries for arbitrary x,xx,x' without explicit data sharing. Data heterogeneity (non-IID condition) further complicates inference: client ii may observe only a subspace or modal subset, so local optima diverge from global solutions. Privacy constraints demand protocols in which exchanged information (model weights, cluster statistics, synthetic data, summary graphs) does not expose individual data points or sensitive attributes (Li et al., 2022, Yan et al., 19 May 2025, He et al., 14 Nov 2025).

2. Classical and Model-Driven Federated Clustering Algorithms

Early federated clustering approaches adapt centralized algorithms to the federated regime by leveraging communication-efficient proxies.

Federated kk-Means (k-FED): Each client runs Lloyd’s kk-means locally, sends centroids to the server, which aggregates (by averaging or further kk-means) and broadcasts updated centroids. This process repeats for several rounds. The global centroid update is typically

μj=iωi,jci,j/iωi,j\mu_j = \sum_i \omega_{i,j} c_{i,j} / \sum_i \omega_{i,j}

where ci,jc_{i,j} is jjth local centroid on client ii and ωi,j\omega_{i,j} is the number of local assignments. To address local/global mismatch in assignments, weighted updates and robust centroid matching are employed (Holzer et al., 2023, Xu et al., 2024).

Federated Fuzzy cc-Means (FFCM): Each client applies fuzzy assignments, returning a weighted set of cluster memberships (membership matrices), which the server aggregates using variants of fuzzy centroids. FFCM improves cluster flexibility but remains sensitive to severe heterogeneity (Stallmann et al., 2022, Yan et al., 2022).

Secure Federated Clustering (SecFC, OmniFC): To achieve central-optimal performance with strong privacy, SecFC (Li et al., 2022), and the more general OmniFC (Yan et al., 19 May 2025), leverage Lagrange coded computing or secret sharing. Clients encode their quantized data as evaluations of secret polynomials, transmitting only shares to the server (and potentially peers). The server reconstructs exact global distance matrices or Lloyd’s kk-means updates through multi-party secure computation, ensuring information-theoretic privacy against server or client collusion.

Dynamic and Asynchronous Protocols: Solutions such as Dynamically Weighted Federated kk-Means (Holzer et al., 2023) and Asynchronous Federated Cluster Learning (AFCL) (Zhang et al., 2024) introduce adaptive weighting/momentum schemes, robust aggregation steps, and asynchronous updates to improve convergence and cope with varying participation and unknown cluster numbers.

3. Representation Learning and Deep Federated Clustering

High-dimensional, non-vectorial, or multimodal data require distributed representation learning for effective FC.

Cluster-Contrastive Federated Clustering (CCFC, CCFC++): CCFC (Liu et al., 2024) operates by sharing cluster-friendly encoders and predictors (often deep nets) across clients, with each client optimizing a cluster-contrastive loss based on global centroids. The protocol alternates server-side aggregation of model parameters and centroids with local contrastive learning. CCFC++ (Yan et al., 2024) introduces a decorrelation regularizer penalizing covariance off-diagonals, mitigating “dimensional collapse” under non-IID splits, and empirically boosting NMI by up to 0.34.

Federated Deep Subspace Clustering (FDSC): FDSC (Zhang et al., 2024) introduces a federated deep subspace clustering network with an encoder (shared, communicated), self-expressive layer (private, modeling intra-client affinities), and decoder (private). Local neighborhood-preserving regularization enhances the self-expressiveness property, and global encoder aggregation occurs via FedAvg. FDSC empirically outperforms centralized deep subspace clustering on various image sets.

Privacy-Preserving Deep Clustering with Synthetic Data: Multiple works (Yan et al., 2022, Yan et al., 2022) propose protocols where clients train generative models (GANs), transmit only synthetic samples to the server, which then performs deep (e.g., autoencoder-based) clustering. This synthetic data proxy scheme improves privacy—since server and peers never see original samples—and is robust to non-IID client distributions. Deep clustering (e.g., via deep clustering networks/DCN) further refines the pseudolabels sent back to clients. Federated cluster-wise refinement frameworks (Nardi et al., 2024) combine autoencoders, cluster-based FL, and cluster association graphs for highly heterogeneous, overlapping global/local cluster sets.

4. Graph- and Structure-Based Federated Clustering

Recent advances leverage structural data representations and private aggregation of graph structures or prototype hierarchies.

Private Federated Graph Clustering (SPP-FGC): Clients encode local data relationships as private structural graphs (e.g., via GMMs, sparse graphical models) and transmit these to the server. The server aggregates block-wise local graphs, aligning cluster-structures via KL divergence and constructing an integrated global graph on which block-diagonalization and spectral embedding yield the global clustering (He et al., 14 Nov 2025). SPP-FGC guarantees differential privacy on model parameters (via Laplace mechanism) and restricts shared information to low-entropy structure.

One-Shot and Hierarchical Federated Clustering: Fed-HIRE (Cai et al., 10 Jan 2026) adopts a client prototype-level communication model: each client discovers fine-grained “clusterlets” using competition-based partitioning, communicates them in a single round, and the server recursively fuses these into a hierarchy of cluster representations (multi-granular clustering). This modular paradigm achieves SOTA results across a wide range of tabular benchmarks.

Collaborative (Vertical/Horizontal) Representations: DC-Clustering (Kawamata et al., 11 Jun 2025) addresses complex, mixed vertical/horizontal data splits by sharing only dimensionality-reduced intermediate representations (constructed using local PCA or learned mappings) and collaboratively constructing a common embedding space via a shared anchor set and affine transformation. Subsequent clustering proceeds centrally on these collaborative representations, matching centralized performance across various real-world scenarios.

5. Privacy and Security Guarantees

A core constraint in FC is the rigorous protection of local data privacy.

  • Information-Theoretic Security: Protocols such as SecFC (Li et al., 2022) and OmniFC (Yan et al., 19 May 2025) use polynomial secret sharing, achieving information-theoretic security against up to TT colluding parties, with server and peers never recovering raw data or cluster assignments beyond the revealed clustering result.
  • Differential Privacy (DP): DP-FedC (Li et al., 2023) and SPP-FGC (He et al., 14 Nov 2025) inject calibrated noise (e.g., Gaussian or Laplace) into all local updates or shared parameters, providing (ϵ,δ)(\epsilon,\delta)-DP guarantees. Direct privacy analysis is also provided for GAN-based synthetic data sharing, where the probability of recovering any individual sample is O(nout/n)O(n_{\text{out}}/n), noutn_{\text{out}} being the synthetic dataset size and nn the original client dataset size (Yan et al., 2022).
  • Local Differential Privacy: Several protocols (Masuyama et al., 2023) use Laplace perturbation at the client on feature values or representative nodes, enabling instance-level privacy in continual learning scenarios.
  • Compressed Secure Aggregation: SCMA (Pan et al., 2022) applies Reed–Solomon–encoded mask sharing for secure and communication-efficient aggregation of sparse cluster prototypes/counts, supporting both federated learning and machine unlearning.

6. Robustness, Unlearning, and Adaptivity

Recent expansions address robustness to extreme heterogeneity, device drop-out, asynchronous settings, and user-driven unlearning.

  • Robustness to Non-IID Data: Protocols such as SDA-FC (Yan et al., 2022) and AFCL (Zhang et al., 2024) sustain high clustering accuracy and stability even as clients become highly non-IID or only cover partial modality subsets. Empirically, traditional centroid-aggregation methods (k-FED, FFCM) collapse under such scenarios.
  • Fault and Drop-out Tolerance: Many one-shot and graph-based protocols (e.g., SDA-FC, SPP-FGC) tolerate up to 50% client failures in practice, maintaining clustering performance by leveraging invariance of global synthetic or graph-based representations.
  • Machine Unlearning: MUFC (Pan et al., 2022) introduces a federated protocol for exact machine unlearning—efficiently removing data points or clients and recomputing clusterings consistent with the reduced dataset, leveraging K-means++ reseeding and SCMA for fast, privacy-preserving updates with up to 84× speed-up over retraining.
  • Asynchronous Convergence and Unknown Cluster Number: AFCL (Zhang et al., 2024) and ART-based continual clustering (Masuyama et al., 2023) address practical requirements—handling unknown kk^*, unbalanced client participation, and continually evolving or streaming data.

7. Empirical Benchmarks and Future Directions

Empirical evaluation across FC algorithms benchmark clustering performance using metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), clustering accuracy, and Kappa statistics. SOTA frameworks (Fed-HIRE (Cai et al., 10 Jan 2026), SPP-FGC (He et al., 14 Nov 2025), OmniFC (Yan et al., 19 May 2025), FDSC (Zhang et al., 2024), CCFC++ (Yan et al., 2024)) achieve near-centralized accuracy and consistently outperform classic federated baselines (k-FED, FFCM) under both IID and non-IID benchmarks, for image, tabular, time-series, and genomics data.

Key empirical findings:

Approach NMI improvement (vs. baseline) Non-IID robustness Provable Privacy
CCFC++ up to 0.34 Yes Standard FL
SPP-FGC up to 10% Yes ε-DP
Fed-HIRE 3–15% Yes Yes
OmniFC 0.2–0.4 (Kappa) Yes Info-theoretic
DP-FedC up to 10% Yes (ε,δ)-DP
SDA-FC/PPFC-GAN stable/robust Yes δ=O(s/n)\delta=O(s/n)

Future research scales FC to streaming clients, dynamic/streaming data, algorithm-agnostic deep clustering, secure aggregation atop DP, and federated clustering under horizontal, vertical, or hybrid data splits. Theoretical challenges include characterizing tight privacy–utility trade-offs, convergence under strong adversaries, and robust model-selection in heterogeneous federated ecosystems.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Federated Clustering (FC).