Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exact SGD in Personalized Federated Learning (PFLEGO)

Updated 1 July 2026
  • The paper introduces PFLEGO, an algorithm that performs exact, unbiased SGD using a partial update schedule and layer-specific computation to tackle client heterogeneity.
  • It builds on the FedPer architecture, splitting models into a shared backbone and client-specific head to optimize both computational load and communication costs.
  • Empirical results on datasets like MNIST and CIFAR-10 demonstrate that PFLEGO outperforms traditional federated methods in high-personalization regimes.

Exact Stochastic Gradient Descent in Personalized Federated Learning with Personalization Layers (PFLEGO) is an algorithmic framework that addresses the challenges of client data heterogeneity in federated learning. Its defining property is performing unbiased, exact SGD over both shared and client-personalized parameters using a partial update schedule and layer-specific computation, resulting in strong convergence guarantees and efficient implementation. PFLEGO is especially tailored to regimes with high personalization demand, where classical FL methods are suboptimal.

1. Federated Optimization Objective

PFLEGO considers a federated scenario with II clients, each equipped with a local labeled dataset Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}. The per-client loss is

ℓi(Wi,θ)=1Ni∑j=1Niℓ(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),

where θ∈Rd\theta \in \mathbb{R}^d represent the common (shared) model weights (backbone), and Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M} are the personalized classifier head weights for client ii (MM is the shared feature dimension and KiK_i the number of classes for client ii). Classically, FL attempts to minimize

L(θ,{Wi}i=1I)=∑i=1Iαi ℓi(Wi,θ),\mathcal{L}(\theta, \{W_i\}_{i=1}^I) = \sum_{i=1}^I \alpha_i\, \ell_i(W_i, \theta),

where Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}0 is the sample-size-based client weight. This structure formalizes the combined global objective, aligning with the non-i.i.d., heterogeneously partitioned data typical of real-world FL settings (Nikoloutsopoulos et al., 2022).

2. Network Architecture and Personalization Layers

PFLEGO builds on the FedPer neural network architecture, decomposing the model per client into:

  • Shared backbone: A deep feature extractor, parameterized by Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}1, mapping Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}2. For example, a 200-unit MLP backbone for MNIST/Fashion-MNIST/EMNIST, two convolutional and two fully connected layers for CIFAR-10 (output Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}3), and four convolutional layers for Omniglot (Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}4).
  • Client-specific head: A single linear layer Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}5 mapping backbone features to logits for Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}6 client-specific classes.

At client Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}7, the model computes Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}8, ensuring a split between universal representation and per-client adaptation.

3. Optimization and Communication Protocol

Each federated round Di={(xi,j,yi,j)}j=1Ni\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}9 consists of three core phases:

3.1 Client Sampling: The server samples a subset ℓi(Wi,θ)=1Ni∑j=1Niℓ(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),0 of ℓi(Wi,θ)=1Ni∑j=1Niℓ(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),1 clients (either binomially with probability ℓi(Wi,θ)=1Ni∑j=1Niℓ(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),2 or as a fixed-size set).

3.2 Local Personalized Updates (Client Side):

  • The server broadcasts â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),3 to clients in â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),4.
  • Each client keeps its local personalized head â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),5.
  • For steps â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),6 to â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),7, the client performs gradient descent updates only on â„“i(Wi,θ)=1Ni∑j=1Niâ„“(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),8:

ℓi(Wi,θ)=1Ni∑j=1Niℓ(yi,j,xi,j;Wi,θ),\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),9

  • At step θ∈Rd\theta \in \mathbb{R}^d0, compute joint gradients:

θ∈Rd\theta \in \mathbb{R}^d1

then update θ∈Rd\theta \in \mathbb{R}^d2 one final time with unbiased scaling:

θ∈Rd\theta \in \mathbb{R}^d3

and send θ∈Rd\theta \in \mathbb{R}^d4 back to the server.

3.3 Exact Server-Side SGD Aggregation:

  • The server aggregates received θ∈Rd\theta \in \mathbb{R}^d5 as

θ∈Rd\theta \in \mathbb{R}^d6

and updates

θ∈Rd\theta \in \mathbb{R}^d7

  • The expected update matches the global gradient, ensuring unbiasedness:

θ∈Rd\theta \in \mathbb{R}^d8

  • The overall process yields an exact unbiased SGD step over the combined parameter vector

θ∈Rd\theta \in \mathbb{R}^d9

4. Theoretical Properties and Convergence

PFLEGO operates under the assumptions that Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}0 is Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}1-smooth and bounded below, and that the stochastic gradients Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}2 are unbiased with bounded variance (Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}3).

  • Unbiasedness: The rescaling by Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}4 neutralizes the selection probability, ensuring

Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}5

  • Convergence rate: Under standard Robbins-Monro conditions (Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}6), the minimum expected squared gradient norm admits

Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}7

  • Special cases: For Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}8 (a single local iteration), PFLEGO is mathematically equivalent to classical SGD. For Wi∈RKi×MW_i \in \mathbb{R}^{K_i \times M}9, the initial ii0 steps often expedite local ii1 minimization, contributing to accelerated convergence in practice.

5. Computational and Communication Efficiency

Let ii2 denote the cost of a complete forward and backward pass through both backbone and head layers.

  • PFLEGO: Each client performs one full forward to cache features, ii3 extremely cheap personalization steps (matrix-vector updates on ii4), and a single full forward-backward pass for the joint gradients. Cumulative cost per round is approximately ii5, i.e., ii6 model passes per round.
  • FedAvg / FedPer: These baselines require ii7 full network passes per round, resulting in ii8 per-round cost.
Method Client Compute per Round Communication (Client→Server)
PFLEGO ii9 (MM0) MM1 (gradients only)
FedAvg MM2 MM3 (full weights)
FedPer MM4 MM5

Communication in all methods includes broadcasting MM6 to clients. PFLEGO thus achieves exact, unbiased SGD with substantially lower per-round compute and communication comparable or favorable to alternatives (Nikoloutsopoulos et al., 2022).

6. Empirical Evaluation

PFLEGO was evaluated across five multi-class benchmarks (MNIST, Fashion-MNIST, EMNIST, CIFAR-10, Omniglot) in three personalization regimes: high-personalization (each client sees only 2 classes), medium-personalization (half the classes), and no-personalization (all classes). The architectures for each dataset use the backbone structures described above.

Baselines for comparison include FedAvg, FedPer, and FedRecon (block-coordinate variant). Metrics include global training loss and final test accuracy averaged over the last 10 rounds.

Dataset FedPer FedAvg PFLEGO
MNIST 97.88±0.25% 97.54±0.25% 98.43±0.21%
CIFAR-10 85.15±1.08% 85.18±0.96% 87.81±0.94%
EMNIST 97.78±0.51% 97.29±0.54% 98.49±0.43%
Fashion-MNIST 96.14±0.35% 96.35±0.47% 96.34±0.43%
Omniglot 68.02±1.74% 49.65±1.74% 74.56±1.23%

In high-personalization, PFLEGO outperforms baselines in accuracy across datasets. Convergence plots show PFLEGO achieves lower training loss in fewer rounds, especially under high heterogeneity. Rate ablations reveal that higher client learning rates (β) and participation rates (r) accelerate convergence.

In medium-personalization, PFLEGO typically matches or outperforms alternatives. In the no-personalization regime, FedAvg can be optimal, as expected. These outcomes suggest PFLEGO delivers strong empirical performance under non-i.i.d. data splitting and significant class heterogeneity.

7. Summary and Significance

PFLEGO realizes a principled approach to distributed personalization in federated learning, attaining exact, unbiased SGD in a setting with both shared (backbone) and client-specific (head) parameters. The methodology offers:

  • Exact distributed SGD and provable convergence under standard smoothness and bounded-variance assumptions
  • Substantially lower computational demands per round—MM7 network passes compared to MM8
  • Efficient communication, with minimal overhead relative to prevailing FL baselines
  • Empirically validated effectiveness in personalized FL, outperforming or matching standard methods in heterogeneity-heavy regimes

The approach delineated in PFLEGO represents a rigorous advance for federated algorithms capable of accommodating personalized model layers and non-i.i.d. data, reinforcing its relevance for large-scale, realistic FL deployments (Nikoloutsopoulos et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exact SGD with Personalization Layers (PFLEGO).