Federated Learning & Privacy-Preserving ML
- Federated learning and privacy-preserving ML are decentralized approaches that enable collaborative model training without sharing raw data.
- Differential privacy, SMPC, and homomorphic encryption are key techniques that provide robust privacy guarantees while balancing utility, fairness, and scalability.
- Empirical studies demonstrate that careful noise calibration and hybrid cryptographic methods maintain near-centralized model performance while mitigating risks like gradient inversion and membership inference.
Federated learning (FL) is a distributed machine learning paradigm enabling multiple entities to collaboratively train a global model without centralizing raw data. Privacy-preserving machine learning (PPML) within FL seeks to mitigate residual privacy risks arising from model update exchange, information leakage via gradients, and regulatory compliance constraints. Advanced cryptographic and statistical mechanisms—including differential privacy (DP), secure multiparty computation (SMC/SMPC), homomorphic encryption (HE), and structural masking—constitute the technical foundation for privacy guarantees in FL. The interplay between privacy, utility, fairness, computational cost, and scalability defines the evolving landscape of privacy-preserving federated learning.
1. Federated Learning Foundations and Privacy Risks
FL formalizes collaborative optimization as
where is client 's local dataset and are shared global parameters (Zhao et al., 2024). Canonical protocols such as FedAvg orchestrate repeated rounds of model broadcast, local SGD, and model update aggregation. On-device computation and the restriction of exchanged information to model deltas offer intrinsic privacy relative to centralized paradigms.
However, FL does not guarantee privacy-by-design. Notable threats include:
- Gradient inversion and data reconstruction: Attackers optimize synthetic data to match observed gradients, revealing raw training samples (Sen et al., 2024, Zhao et al., 2024).
- Membership inference: Adversaries distinguish training vs. non-training points based on model dynamics or outputs.
- Property and attribute inference: Indirect inference of sensitive attributes via black-box or white-box probing.
These findings have been systematically validated in both simulated and real FL deployments. Membership inference can succeed with advantage rates as high as 90% on some datasets if no additional protections are in place (Truong et al., 2020).
2. Differential Privacy in Federated Learning
Differential privacy imparts a quantifiable, mathematically rigorous notion of privacy to stochastic mechanisms. An -differentially private (DP) mechanism satisfies:
for all adjacent datasets , and measurable sets (Rafi et al., 2023).
FL with (ε,δ)-DP: The standard construction applies Gaussian noise to clipped per-client updates (Sen et al., 2024, Ganadily et al., 2024):
where 0 is a clipping bound and 1 is calibrated to the desired privacy budget:
2
Each communication round compounds privacy loss; advanced composition and moments accounting track cumulative 3. PrivacyFL, PrivFL, and numerous production platforms use such mechanisms (Mugunthan et al., 2020, Mandal et al., 2020, Ganadily et al., 2024).
Local vs. Centralized DP: Local DP (client-side) offers stronger privacy, especially in cross-device FL, at higher utility loss. Centralized DP assumes a trusted aggregator and requires less noise for equivalent privacy (Rafi et al., 2023).
Adaptive mechanisms and trade-offs: Over-noising degrades utility, especially in long-lived or dynamic settings. Adaptive per-round calibration (as in FedHDPrivacy) and distributed cryptographic noise (as in PrivacyFL's noise splitting) reduce unnecessary noise (Piran et al., 2024, Mugunthan et al., 2020).
3. Cryptographic Aggregation: SMPC, Homomorphic Encryption, and Trusted Execution
SMPC-based Secure Aggregation: Protocols (e.g., Bonawitz et al., SPDZ, Shamir's secret sharing) enable the server to recover the sum of model updates while learning nothing about individual contributions. PrivacyFL, PrivFairFL, and many cross-device production FL systems implement such protocols. Properties (Truong et al., 2020, Pentyala et al., 2022, Mugunthan et al., 2020):
- Information-theoretic privacy for honest-but-curious server, up to threshold 4 malicious colluders.
- Dropout tolerance, robust to client departure.
- Communication cost 5 (Bonawitz protocol), but streamlined in more recent schemes.
Homomorphic Encryption (HE): Additive (Paillier) and leveled (CKKS) HE schemes allow FL servers to aggregate encrypted model updates. State-of-the-art frameworks such as FedML-HE reduce computation and bandwidth by selective parameter encryption, focusing on most privacy-sensitive weights (Jin et al., 2023, Dutta et al., 2024). Recent advances achieve:
- Selective-HE overhead factors of 6–7 less than full HE (e.g., for foundation models).
- CKKS supports approximate arithmetic for high-dimensional real vector updates.
Trusted Execution Environments (TEEs): TEEs such as Intel SGX perform privacy-critical aggregation in hardware-isolated enclaves. Used for moderate-scale settings; vulnerable to side channels and limited by enclave memory (Truong et al., 2020).
4. Architectures, Mechanisms, and Frameworks
Table: Core Mechanisms in Privacy-Preserving FL
| Mechanism | Privacy Guarantee | Cost/Overhead |
|---|---|---|
| (ε, δ)-DP | Statistical, (ε, δ)-DP | Utility loss ↑ as ε↓ |
| SMPC/SecAgg | Information-theoretic | Comm = 8 |
| HE | IND-CPA (strong) | 9–0 |
| Group Signature | Identity and unlinkability | 1 per round |
Architectures and frameworks:
- PrivacyFL: Modular simulator integrating DP, SMPC, centralized/decentralized FL, client dropout, and latency simulation (Mugunthan et al., 2020).
- AMI-FML: Privacy-preserving FL for Advanced Metering Infrastructure; combines intensive communication compression (quantization, random masking) and inherent privacy from never transferring raw time series (Biswal et al., 2021).
- FedML-HE: Parameter-selective encryption with CKKS; empirical 2 maximum accuracy loss on ResNet/BERT with up to 3 bandwidth reduction (Jin et al., 2023).
- FedXGBoost: Secure matrix multiplication or local DP for federated tree boosting; information-theoretic or 4-LDP guarantees depending on variant (Le et al., 2021).
- GSFL: Group signatures for client/identity privacy; provides constant 5 signing/verification overhead and resistance to linkage/attribute inference (Kanchan et al., 2022).
Personalized FL: Recent work extends privacy-preserving FL to personalized model variants, where each client solves
6
with privacy protection applied to personalized deltas (e.g., APPLE+DP, APPLE+HE) (Hosain et al., 3 May 2025).
5. Empirical Results and Privacy–Utility Trade-offs
Robust empirical findings across frameworks and application domains:
- DP-utility trade-off: As privacy tightens (7), utility drops. For moderate 8, DP-FL models can achieve accuracy within 9–0 of non-private models on MNIST, CIFAR-10, VirusMNIST, and EHR regression (Ganadily et al., 2024, Rafi et al., 2023, Hosain et al., 3 May 2025). Overzealous noise destroys accuracy, especially in local DP.
- HE and SMPC overheads: Full-model HE adds 1–2 latency for large models, but selective masking restores near-plaintext runtime (Jin et al., 2023). SMPC with optimized aggregation scales to 3 clients; higher 4 is communication-bound.
- Structured masking/compression: AMI-FML demonstrates that random masking (zero density 5) and 6-bit quantization reduce communication 7 with negligible accuracy loss in time-series FL (Biswal et al., 2021).
- Hybrid (DP+SMPC): Combining DP with cryptographically protected aggregation (threshold HE, SMC) requires less noise per-client and achieves better accuracy scaling with 8, particularly in high-trust cross-silo settings (Truex et al., 2018).
- Privacy–fairness interplay: DP noise can worsen group fairness, but combined schemes (PrivFairFL) achieve 9-DP and 0 accuracy loss while bounding group disparities (e.g., statistical parity difference 1 at 2) (Pentyala et al., 2022, Rafi et al., 2023).
6. Special Topics: Resource-Constrained, Uncertainty, and Explainability
- Edge FL and IoT: Lightweight head-only retraining with secure aggregation, as in mHealth FL, achieves 3 of full accuracy at marginal computational and communication cost—critical in wearables (Aminifar et al., 2024).
- Explainable and continual-learning FL: Hyperdimensional computing frameworks (FedHDPrivacy) support both DP and auditability of cumulative noise, enabling interpretable and lifelong learning on IoT streams (Piran et al., 2024).
- Uncertainty estimation: PPFL in medical imaging now integrates Bayesian, conformal, and calibration methods with privacy guarantees, including DP-protected conformal predictions and ensemble methods that avoid raw data or parameter leakage (Koutsoubis et al., 2024).
7. Regulatory, Fairness, and Open Challenges
- GDPR and Regulatory Alignment: FL is well-positioned for GDPR Article 5 compliance (data minimization, purpose limitation). Additional technical measures—DP, SMPC, HE—fulfill “privacy by design” (Art. 25). FL provides strong default erasure and pseudonymization capabilities (Truong et al., 2020, Sen et al., 2024).
- Fairness under privacy: DP noise can compromise fairness, especially for minorities. Joint DP+fairness interventions (PrivFairFL, FairFed) use SMPC and DP to publish group statistics without leaking sensitive attributes; practical systems can bound fairness metrics with little utility loss (Pentyala et al., 2022, Rafi et al., 2023).
- Scalability and System Gaps: Practical SMPC and HE still face scalability limits for millions of clients. Efficient aggregation (e.g., LightSecAgg) and hybrid designs (TEE+DP, DP+HE) are under active exploration (Zhao et al., 2024).
- Dynamic and Non-IID Environments: Adaptive DP budgeting, communication-efficient privacy, and robust aggregation under client/workload churn represent significant engineering and theoretical frontiers (Piran et al., 2024).
- Certifiable Privacy and Auditability: Formal certification, attestation, and transparent audit of privacy budgets are prerequisites for regulatory and industry deployment (Zhao et al., 2024, Truong et al., 2020).
Taken together, federated learning and privacy-preserving ML research demonstrate that strong privacy can be achieved through a judicious combination of differential privacy, cryptographic aggregation, adaptive protocol design, and regulatory alignment. While current methods can approach or match centralized utility under moderate privacy budgets, challenges of scalability, fairness, and robust dynamic operation remain active areas for innovation (Koutsoubis et al., 2024, Rafi et al., 2023, Piran et al., 2024).