Personalized Federated Learning

Updated 3 September 2025

Personalized Federated Learning (pFL) is a decentralized approach that tailors models to each client’s unique data while preserving privacy.
It employs strategies such as meta-learning, modular architectures, and Bayesian methods to balance global generalization with local adaptation and mitigate client drift.
Empirical evaluations show that pFL methods enhance accuracy, convergence speed, and communication efficiency in heterogeneous, non-IID environments.

Personalized Federated Learning (pFL) refers to a collection of methods in decentralized machine learning where the goal is to produce models that are tailored to the data distribution of each individual client rather than a single monolithic global model. pFL arises in response to the core challenge of statistical heterogeneity found naturally in federated settings due to non-IID data residing on each client, demographic diversity, or non-uniform environments. Approaches in pFL aim to reconcile the trade-off between leveraging population-wide knowledge for generalization and adapting to local idiosyncrasies for personalization, preserving privacy by keeping all raw data local.

1. Fundamental Motivations and Taxonomy

Classic Federated Learning (FL) protocols such as FedAvg aggregate local updates to create a single global model, which often fails to accommodate the divergent data pathways encountered in, for example, health, mobile, or language modeling applications (Tan et al., 2021). The main motivations for pFL include:

Privacy preservation and regulatory compliance without data pooling.
Improved adaptability to diverse, non-IID data distributions.
Mitigating ‘client drift’ and poor convergence resulting from inconsistent local objectives.

Major classes of pFL techniques can be structured as follows:

Strategy	Method Classes	Example Techniques
Personalize a global model	Data-based, Model-based	Data augmentation, meta-learning, regularized objectives, transfer learning
Directly learn personalized models	Architecture-, Similarity-based	Parameter decoupling, knowledge distillation, clustering, MTL, model interpolation

Model-based approaches may employ proximal terms or meta-learning (e.g., Per-FedAvg), whereas architecture-based approaches decompose networks into shared and personalized modules (e.g., FedPer, pFedMB, pMixFed). Similarity-based methods include task clustering or prototype-based interpolation, often in a multi-task learning framework (Tan et al., 2021).

2. Methodological Innovations and Representative Algorithms

Modern pFL methods introduce algorithmic structure beyond simple fine-tuning or regularization. Select notable frameworks include:

Mixture-of-Experts (MoE) Mixing: PFL-MoE and its variants (PFL-MF, PFL-MFE) combine global and personalized model outputs via a gating network. The mixed output for an input $x$ adopts the form:

$\hat{y}_{\text{mix}} = g \cdot M_G(\theta; x) + (1-g) \cdot M_G(\theta_i; x)$

Here, $g$ is produced by a client-side gating network (using features $a$ or $x$ ), balancing generalization and personalization by dynamically weighing the global and local predictions (Guo et al., 2020).

Layer-Wise/Block-Wise Adaptation and Mixup: Methods such as pMixFed and FLAYER employ adaptive, layer-wise schemes, where the transition from fully shared (“base”) layers to fully personalized (“head”) layers is governed by a mixup coefficient or a sparse binary mask, often dynamically updated per client and round. For example, pMixFed mixes parameters:

$L_{k,i}'^{(t)} = \lambda_{k,i}^{(t)}\, G_i^{(t)} + (1 - \lambda_{k,i}^{(t)})\, L_{k,i}^{(t)}$

with aggregation itself leveraging mixup on global/local parameters to mitigate catastrophic forgetting (Saadati et al., 19 Jan 2025, Chen et al., 10 Dec 2024).

Modular/Branching Architectures: FedMN and pFedMB allow clients to personalize by selecting or weighting modules or branches from a shared pool, with routing or client-specific weighting mechanisms. In pFedMB, a layer is realized as

$W_\ell = \sum_{b=1}^B \alpha_{b,\ell} W_{b,\ell}$

and branch aggregation at the server is $\alpha$ -weighted across clients, accelerating convergence and aligning clients with similar data (Wang et al., 2022, Mori et al., 2022).

Feature Distribution Adaptation and Bayesian Priors: pFedFDA estimates global (class-conditional Gaussian) feature posteriors and lets clients interpolate between global and noisy local estimators via a validation-optimized $\beta$ . Bayesian approaches (pFedBreD, pFedGP, PAC-PFL) use global models as hierarchical priors with regularization via (KL or Bregman) divergence, enabling robust adaptation under limited data (Shi et al., 2022, Boroujeni et al., 16 Jan 2024, Achituve et al., 2021, Mclaughlin et al., 1 Nov 2024).

3. Theoretical Underpinnings: Objective Functions and Regularization

Core pFL optimization objectives build upon the foundational FL aggregate empirical risk minimization; however, they introduce client-important modifications. A representative family of objectives is:

Regularized Local Loss (e.g., Ditto, pFedMe):

$\min_{\theta, \{\theta_k\}} \sum_{k=1}^K \frac{1}{|\mathcal{D}_k|} \left( F_k(\theta_k) + \alpha_k \|\theta_k - \theta\|^2 \right)$

Personalization is achieved by tightly coupling or decoupling $\theta_k$ (the local/personalized model) from $\theta$ (the global model).

PAC-Bayesian Framework: In PAC-PFL and pFedGP, the objective is to minimize a PAC-Bayesian upper bound on risk, with terms including empirical loss and a (divergence-based) complexity penalty between local posteriors and the global hyper-posterior:

$\mathcal{L}^C(Q_i, D_i) \leq \widehat{\mathcal{L}}^C(Q_i, S_i \cup \tilde{S}_i) + \frac{1}{\beta}\left( KL(Q_i \| P) + \text{complexity terms} \right)$

This tightly controls overfitting in the low- $n$ regime (Boroujeni et al., 16 Jan 2024).

Mixture and Interpolation Strategies: In approaches such as pFedFDA, the estimator for (say, mean vector $\mu_i$ ) is a convex combination of global ( $\mu_g$ ) and local ( $\hat\mu_i$ ) statistics, with the coefficient $\beta_i$ cross-validated to minimize bias-variance trade-off (Mclaughlin et al., 1 Nov 2024).

4. Empirical Performance, Resource Efficiency, and Practical Benchmarks

Empirical evaluations consistently show that:

Personalized aggregation methods (e.g., FedALA, FedFomo) tend to provide fastest convergence and highest accuracy on highly heterogeneous data, albeit potentially at higher memory cost due to the need to store and process multiple models per client (Khan et al., 10 Sep 2024).
Layer-wise/frame-wise strategies (e.g., FLAYER, pMixFed) yield faster convergence, higher final accuracy, and significant reductions in computation and communication rounds compared to rigid fine-tuning or global modeling approaches. For example, FLAYER demonstrates average accuracy improvements of 5.4 percentage points (up to 14.3 on CIFAR-100) over six strong baselines, while also saving up to 80.1% in training time on some tasks (Chen et al., 10 Dec 2024).
Bayesian/generalization-based methods maintain robustness when clients have few samples, with calibration and uncertainty metrics (e.g., RSMSE, CE) outperforming standard FL or non-Bayesian pFL methods, especially under covariate shift or data corruption scenarios (Achituve et al., 2021, Boroujeni et al., 16 Jan 2024, Mclaughlin et al., 1 Nov 2024).
Modular/branching architectures and client-level model mixing further speed up adaptation and allow clients with similar data to share more features, automatically clustering the clients over training as shown by the evolution of branch weights clustering per client (Mori et al., 2022).
Generative replay approaches (e.g., pFedGRP) address time-varying data distributions and catastrophic forgetting when history cannot be stored, offering higher average accuracy and lower forgetting metrics than continual FL baselines (Tan et al., 2 Oct 2024).

5. Communication, Privacy, and Scalability

Communication efficiency is central for pFL at scale. Approaches such as block-wise modular communication (FedMN), parameter masking and selective upload (FLAYER), or model stacking (pFL via stacking) minimize transmission overhead while supporting flexible, personalized aggregation (Wang et al., 2022, Chen et al., 10 Dec 2024, Cantu-Cervini, 16 Apr 2024). Many methods explicitly allow or encourage partial updates (e.g., only active module parameters or most significant layer updates), substantially reducing bandwidth and memory requirements per round.

Privacy is maintained via classical FL assumptions (no raw data sharing) and reinforced by privacy-preserving model sharing (e.g., using Differential Privacy–perturbed predictors in stacking (Cantu-Cervini, 16 Apr 2024)). PAC–Bayesian and Bayesian frameworks provide explicit uncertainty quantification, which is valuable for safety-critical and regulatory-sensitive applications (Boroujeni et al., 16 Jan 2024, Achituve et al., 2021).

Scalability is enhanced by architecture-level innovations (modular networks, multi-branch, block coordinate optimization with mirror descent) and efficient communal codebases such as PFLlib (Zhang et al., 2023).

6. Interpretability, Trustworthiness, and Future Directions

Recent advances—such as PPFL and personalized stacking—explicitly address model interpretability by representing each client’s preference as a membership vector over canonical models, directly revealing subgroup structure and relationships between clients (Di et al., 2023). Disentanglement approaches (e.g., FedDVA) factorize representations into shared and client-specific components, aiding both explainability and transfer to downstream tasks (Yan et al., 2023).

Open challenges remain, including:

Dynamic heterogeneity: evolving client distributions and tasks, addressable via generative replay, adaptive partitioning, or continual learning strategies (Tan et al., 2 Oct 2024, Saadati et al., 19 Jan 2025).
Adaptive, on-device selection of personalization degree and aggregation weights, potentially leveraging NAS or meta-learning.
Robust, scalable benchmarks and holistic metrics (not just accuracy, but communication cost, privacy risk, fairness).
Secure, fair, and incentivized federated frameworks for real-world deployment, possibly integrating blockchain or game theory concepts (Tan et al., 2021).

7. Cross-Disciplinary Applications and Benchmarking Resources

pFL methods have demonstrated practical success in domains such as healthcare (personalized predictive models under strict privacy budgets), mobile keyboard prediction, domain-adaptive visual recognition, and recommendation systems (Tan et al., 2021, Achituve et al., 2021). Tools like PFLlib enable reproducible comparison of over 25 pFL algorithms across modalities with privacy-preserving evaluation, facilitating progress and standardization in the community (Zhang et al., 2023).

In sum, personalized federated learning encompasses a spectrum of algorithmic strategies at the intersection of optimization, architecture, information theory, and privacy. State-of-the-art methods combine modular modeling, adaptive dynamic personalization, Bayesian inference, and communication-efficient distributed optimization to address the inherent tension between generalization, personalization, resource constraints, and privacy present in real-world FL deployments. The discipline continues to advance rapidly, driven by the twin imperatives of rigor and practical utility.