Federated Learning Strategies

Updated 20 November 2025

Federated learning strategies are methods for collaborative, privacy-preserving model training across distributed clients using techniques like FedAvg and FedProx.
They employ communication-efficient approaches, such as structured and sketched updates, to minimize data exchange while maintaining convergence.
Robust client selection, personalized model adaptation, and strong privacy mechanisms ensure fairness and adaptability in heterogeneous environments.

Federated learning (FL) strategies refer to the algorithmic, architectural, and procedural methods enabling the collaborative training of machine learning models across multiple, distributed, and data-private clients under central or decentralized orchestration. The primary motivation is to harness heterogeneous, private data silos—such as mobile devices, hospitals, or enterprises—by exchanging only model parameters, gradients, or loss statistics, rather than raw data. These strategies must balance statistical and system heterogeneity, privacy guarantees, scalability, communication constraints, and adaptive responsiveness to drifting environments.

1. Foundational Algorithmic Strategies

The canonical FL optimization setup seeks to minimize a global risk $F(w) = \sum_{k=1}^K p_k F_k(w)$ , where $w$ is the global model, $F_k$ is the local risk on client $k$ , and $p_k$ are typically proportional to data shares. The central strategy is Federated Averaging (FedAvg): a server at round $t$ broadcasts $w_t$ , selected clients run local SGD (often for $E$ epochs) to produce $w_{t+1}^k$ , and the server aggregates via $w_{t+1} = \sum_{k\in S_t} (p_k/\sum_j p_j) w_{t+1}^k$ (Fernandez et al., 2023). FedProx adds a proximal term to the local objective, regularizing client drift under non-IID data (Fernandez et al., 2023, Tertulino, 3 Sep 2025). Advanced variants, such as SCAFFOLD (control variate reduction) (Gafni et al., 2021), server-side adaptive aggregation (FedAdam, FedAdagrad), and clustering-based methods (FedCluster), target convergence or stability under strong heterogeneity (Tertulino, 3 Sep 2025).

In dynamic, nonstationary settings, Dynamic Federated Learning models the optimizer’s target $w_i^o$ as a random walk, yielding performance bounds of the form

$\mathbb{E}\|w_i - w_i^o\|^2 \leq O(\gamma^i) + O(\mu)(\sigma_s^2 + \epsilon^2) + O(\mu^{-1})\sigma_q^2$

where $\mu$ is the learning rate, $\sigma_s^2$ and $\epsilon^2$ quantify data and model noise, and $\sigma_q^2$ the environment drift (Rizk et al., 2020). This elucidates the trade-off between steady-state accuracy and adaptive tracking.

Federated Daisy-Chaining (FedDC) interleaves rounds of model averaging with daisy-chain permutations, allowing models to be trained on a chain of different small-sample datasets—crucial for FL in extreme data-sparse regimes (Kamp et al., 2021).

2. Communication-Efficient Approaches

Communication bottlenecks prompt strategies that reduce the information transferred per round. Structured updates constrain local updates to low-rank or sparse subspaces, with only subspace coefficients transmitted (Konečný et al., 2016). Sketched updates use random rotation, quantization, and random subsampling to compress full local updates before upload. As these schemes are unbiased, the induced variance is tractable and does not asymptotically harm convergence (Konečný et al., 2016, Gafni et al., 2021).

Further, evolutionary strategies (ES) approaches such as EvoFed and FedES replace parameter exchange with scalar loss or “fitness” statistics evaluated on a synchronized, noise-perturbed population. Each client computes similarity scores between their locally-updated model and each random perturbation of the global model, dramatically reducing per-round communication to $O(N)$ scalars, where $N$ is the (much smaller) population size relative to the model dimension (Rahimi et al., 2023, Lan, 2023). In FedES, these scalar losses, combined with antithetic perturbations and a pre-shared seed, enable gradient estimation while maintaining privacy (Lan, 2023).

Loss-based mutual learning (FL-DML) transmits only predicted losses or soft-labels on a public test set, further reducing communication and limiting attack surface for model inversion (Gupta, 3 Mar 2025).

3. Client Selection and Aggregation Strategies

Client selection governs participation fairness, statistical representativeness, and convergence. Basic random sampling is common but can under-sample minority data modes or rare clients. Loss-based selection prioritizes clients with high current loss to accelerate convergence but may induce instability or over-focus on outliers (Legler et al., 2024, Cho et al., 2020). Cluster-based selection groups clients by “signatures” (e.g., final-layer activations) and ensures coverage of all clusters each round, yielding both faster and more stable convergence, particularly in cross-silo, non-IID federations (Legler et al., 2024).

More refined strategies, such as distribution-controlled client selection (DC), optimize each round’s active client set to align the overall label distribution with a specified target (e.g., balanced or global average), via a greedy alignment of per-client label histograms. The method, when applied on top of FedAvg, FedProx, or attention-based aggregation (FedAtt), improves convergence and F1-score, with choice of target dependent on the imbalance regime—balanced target for local imbalance, real global distribution for global imbalance (Düsing et al., 25 Sep 2025).

Peer-to-peer aggregation approaches, particularly relevant for biomedical data, enable a range of weighted averaging strategies:

Data-size weighting for statistical reliability,
Accuracy-based weighting (e.g., more weight to peers with lower local accuracy) for improving worst-case clients,
Contribution-based and cluster-based weighting to adapt to local distribution shift or difficult data (Salmeron et al., 2024).

4. Personalization, Model Heterogeneity, and Adaptation

Personalization addresses the reality that a global model may not fit all clients well, especially under non-IID data. Strategies include local fine-tuning, multi-headed models with shared feature extractors but client-specific heads, and meta-learning for fast local adaptation (Nasim et al., 7 Feb 2025, Fernandez et al., 2023). Federated Mutual Learning (FML) maintains both a meme model (global knowledge exchange) and a personalized model per client, trained via mutual distillation (bidirectional KL divergence) without parameter sharing across private architectures (Shen et al., 2020).

Concurrent vertical and horizontal (“square”) FL frameworks, including those based on fuzzy cognitive maps (FCMs), permit simultaneous sample- and feature-wise heterogeneity, with aggregation performed by merging local FCM adjacency matrices and weighting strategies (constant, accuracy-, precision-, or AUC-based) to balance group versus individual performance (Salmeron et al., 2024).

Model architecture search and selection, optimized for communication or computation given client capacities and data schemas, increasingly leverage federated neural architecture search (FNAS) (Nasim et al., 7 Feb 2025).

5. Privacy and Security Mechanisms

Strong privacy guarantees are central. Differential privacy (DP) is enforced by local or server-side noise addition, with DP-SGD enforcing per-iteration noise calibrated for a global $(\epsilon, \delta)$ budget (Daly et al., 2024, Bhaskar et al., 24 Jan 2025). A key result demonstrates that, under fixed DP budget and total update steps, one local epoch per global round (i.e., E=1) is optimal: more local epochs per round linearly degrade utility for a given privacy guarantee. Moreover, increasing the number of clients improves aggregate utility under a fixed budget by averaging out the DP-induced noise (Bhaskar et al., 24 Jan 2025).

Other privacy controls include secure aggregation (e.g., Bonawitz protocol) and homomorphic encryption, protecting individual updates even from an honest-but-curious server (Nasim et al., 7 Feb 2025, Daly et al., 2024). Rogue servers or incomplete DP guarantee enforcement can be mitigated with trusted execution environments (TEEs) and open-source, auditable aggregation binaries (Daly et al., 2024).

Privacy-focused communication-efficient strategies, such as EvoFed and FedES, provide information-theoretic leakage reduction, since recovery of gradients or private data is unfeasible without knowing the pre-shared random seed or population (Rahimi et al., 2023, Lan, 2023).

6. Special Architectures, System Heterogeneity, and Real-World Implementation

FL deployments must accommodate clients with widely varying compute, communication, network latency, and data properties. Techniques include:

Asynchronous aggregation (e.g., CO-OP) with staleness-aware weights (Nasim et al., 7 Feb 2025)
Hierarchical aggregation (edge-level synchronization, cloud-level final aggregation) (Nasim et al., 7 Feb 2025)
Hybrid vertical/horizontal FL frameworks for institutions with non-aligned sample and feature schemas (Jiang et al., 28 Jan 2025, Salmeron et al., 2024)
Resource-aware client scheduling and transmission allocation, drawing from signal processing (random access, over-the-air functional computation, consensus/diffusion protocols) (Gafni et al., 2021)
Incentive mechanisms, governance models, and compliance layers in organizational FL frameworks (Fernandez et al., 2023)

System-level implementations, such as in industrial and cross-silo settings, benefit from robust API abstractions (as in IBM Federated Learning (Ludwig et al., 2020)), flexible aggregation logic (enabling, e.g., Byzantine tolerance with geometric medians or trimmed mean), and deployment best practices (partial participation, quorum-based rounds, modular cryptographic wrapping).

7. Benchmarking, Evaluation, and Hyperparameter Guidelines

Comprehensive benchmarking highlights that the optimal FL strategy is often context-specific. For medical and high-heterogeneity tasks, FedProx with tuned proximal regularization is most robust (Tertulino, 3 Sep 2025). Under cross-silo and severe data imbalance, accuracy-weighted peer-to-peer aggregation and distribution-controlled client selection consistently outperform classical FedAvg (Salmeron et al., 2024, Düsing et al., 25 Sep 2025).

Hyperparameters that play a critical role include:

Learning rate $\mu$ , optimally set to balance the gradient noise and tracking loss as per $\mu(\sigma_s^2+\epsilon^2) \simeq \sigma_q^2/\mu$ in nonstationary FL (Rizk et al., 2020)
Number of local epochs $E$ ; $E=1$ is optimal under DP constraints (Bhaskar et al., 24 Jan 2025)
Mini-batch size $B_k$ and participating client fraction $C$ for communication-accuracy tradeoff (Nasim et al., 7 Feb 2025, Fernandez et al., 2023)
Aggregation weights, selection set size, and staleness thresholds

Trade-offs universally exist between communication/computation cost, statistical efficiency, privacy guarantees, personalization, and fairness. Systematic tuning, guided by principled performance bounds and scenario-driven empirical evaluation, is essential for effective federated learning deployments.