Exact SGD in Personalized Federated Learning (PFLEGO)

Updated 1 July 2026

The paper introduces PFLEGO, an algorithm that performs exact, unbiased SGD using a partial update schedule and layer-specific computation to tackle client heterogeneity.
It builds on the FedPer architecture, splitting models into a shared backbone and client-specific head to optimize both computational load and communication costs.
Empirical results on datasets like MNIST and CIFAR-10 demonstrate that PFLEGO outperforms traditional federated methods in high-personalization regimes.

Exact Stochastic Gradient Descent in Personalized Federated Learning with Personalization Layers (PFLEGO) is an algorithmic framework that addresses the challenges of client data heterogeneity in federated learning. Its defining property is performing unbiased, exact SGD over both shared and client-personalized parameters using a partial update schedule and layer-specific computation, resulting in strong convergence guarantees and efficient implementation. PFLEGO is especially tailored to regimes with high personalization demand, where classical FL methods are suboptimal.

1. Federated Optimization Objective

PFLEGO considers a federated scenario with $I$ clients, each equipped with a local labeled dataset $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ . The per-client loss is

$\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$

where $\theta \in \mathbb{R}^d$ represent the common (shared) model weights (backbone), and $W_i \in \mathbb{R}^{K_i \times M}$ are the personalized classifier head weights for client $i$ ( $M$ is the shared feature dimension and $K_i$ the number of classes for client $i$ ). Classically, FL attempts to minimize

$\mathcal{L}(\theta, \{W_i\}_{i=1}^I) = \sum_{i=1}^I \alpha_i\, \ell_i(W_i, \theta),$

where $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 0 is the sample-size-based client weight. This structure formalizes the combined global objective, aligning with the non-i.i.d., heterogeneously partitioned data typical of real-world FL settings (Nikoloutsopoulos et al., 2022).

2. Network Architecture and Personalization Layers

PFLEGO builds on the FedPer neural network architecture, decomposing the model per client into:

Shared backbone: A deep feature extractor, parameterized by $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 1, mapping $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 2. For example, a 200-unit MLP backbone for MNIST/Fashion-MNIST/EMNIST, two convolutional and two fully connected layers for CIFAR-10 (output $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 3), and four convolutional layers for Omniglot ( $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 4).
Client-specific head: A single linear layer $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 5 mapping backbone features to logits for $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 6 client-specific classes.

At client $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 7, the model computes $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 8, ensuring a split between universal representation and per-client adaptation.

3. Optimization and Communication Protocol

Each federated round $\mathcal{D}_i = \{(x_{i,j}, y_{i,j})\}_{j=1}^{N_i}$ 9 consists of three core phases:

3.1 Client Sampling: The server samples a subset $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 0 of $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 1 clients (either binomially with probability $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 2 or as a fixed-size set).

3.2 Local Personalized Updates (Client Side):

The server broadcasts $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 3 to clients in $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 4.
Each client keeps its local personalized head $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 5.
For steps $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 6 to $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 7, the client performs gradient descent updates only on $\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 8:

$\ell_i(W_i, \theta) = \frac{1}{N_i} \sum_{j=1}^{N_i} \ell(y_{i,j}, x_{i,j}; W_i, \theta),$ 9

At step $\theta \in \mathbb{R}^d$ 0, compute joint gradients:

$\theta \in \mathbb{R}^d$ 1

then update $\theta \in \mathbb{R}^d$ 2 one final time with unbiased scaling:

$\theta \in \mathbb{R}^d$ 3

and send $\theta \in \mathbb{R}^d$ 4 back to the server.

3.3 Exact Server-Side SGD Aggregation:

The server aggregates received $\theta \in \mathbb{R}^d$ 5 as

$\theta \in \mathbb{R}^d$ 6

and updates

$\theta \in \mathbb{R}^d$ 7

The expected update matches the global gradient, ensuring unbiasedness:

$\theta \in \mathbb{R}^d$ 8

The overall process yields an exact unbiased SGD step over the combined parameter vector

$\theta \in \mathbb{R}^d$ 9

4. Theoretical Properties and Convergence

PFLEGO operates under the assumptions that $W_i \in \mathbb{R}^{K_i \times M}$ 0 is $W_i \in \mathbb{R}^{K_i \times M}$ 1-smooth and bounded below, and that the stochastic gradients $W_i \in \mathbb{R}^{K_i \times M}$ 2 are unbiased with bounded variance ( $W_i \in \mathbb{R}^{K_i \times M}$ 3).

Unbiasedness: The rescaling by $W_i \in \mathbb{R}^{K_i \times M}$ 4 neutralizes the selection probability, ensuring

$W_i \in \mathbb{R}^{K_i \times M}$ 5

Convergence rate: Under standard Robbins-Monro conditions ( $W_i \in \mathbb{R}^{K_i \times M}$ 6), the minimum expected squared gradient norm admits

$W_i \in \mathbb{R}^{K_i \times M}$ 7

Special cases: For $W_i \in \mathbb{R}^{K_i \times M}$ 8 (a single local iteration), PFLEGO is mathematically equivalent to classical SGD. For $W_i \in \mathbb{R}^{K_i \times M}$ 9, the initial $i$ 0 steps often expedite local $i$ 1 minimization, contributing to accelerated convergence in practice.

5. Computational and Communication Efficiency

Let $i$ 2 denote the cost of a complete forward and backward pass through both backbone and head layers.

PFLEGO: Each client performs one full forward to cache features, $i$ 3 extremely cheap personalization steps (matrix-vector updates on $i$ 4), and a single full forward-backward pass for the joint gradients. Cumulative cost per round is approximately $i$ 5, i.e., $i$ 6 model passes per round.
FedAvg / FedPer: These baselines require $i$ 7 full network passes per round, resulting in $i$ 8 per-round cost.

Method	Client Compute per Round	Communication (Client→Server)
PFLEGO	$i$ 9 ( $M$ 0)	$M$ 1 (gradients only)
FedAvg	$M$ 2	$M$ 3 (full weights)
FedPer	$M$ 4	$M$ 5

Communication in all methods includes broadcasting $M$ 6 to clients. PFLEGO thus achieves exact, unbiased SGD with substantially lower per-round compute and communication comparable or favorable to alternatives (Nikoloutsopoulos et al., 2022).

6. Empirical Evaluation

PFLEGO was evaluated across five multi-class benchmarks (MNIST, Fashion-MNIST, EMNIST, CIFAR-10, Omniglot) in three personalization regimes: high-personalization (each client sees only 2 classes), medium-personalization (half the classes), and no-personalization (all classes). The architectures for each dataset use the backbone structures described above.

Baselines for comparison include FedAvg, FedPer, and FedRecon (block-coordinate variant). Metrics include global training loss and final test accuracy averaged over the last 10 rounds.

Dataset	FedPer	FedAvg	PFLEGO
MNIST	97.88±0.25%	97.54±0.25%	98.43±0.21%
CIFAR-10	85.15±1.08%	85.18±0.96%	87.81±0.94%
EMNIST	97.78±0.51%	97.29±0.54%	98.49±0.43%
Fashion-MNIST	96.14±0.35%	96.35±0.47%	96.34±0.43%
Omniglot	68.02±1.74%	49.65±1.74%	74.56±1.23%

In high-personalization, PFLEGO outperforms baselines in accuracy across datasets. Convergence plots show PFLEGO achieves lower training loss in fewer rounds, especially under high heterogeneity. Rate ablations reveal that higher client learning rates (β) and participation rates (r) accelerate convergence.

In medium-personalization, PFLEGO typically matches or outperforms alternatives. In the no-personalization regime, FedAvg can be optimal, as expected. These outcomes suggest PFLEGO delivers strong empirical performance under non-i.i.d. data splitting and significant class heterogeneity.

7. Summary and Significance

PFLEGO realizes a principled approach to distributed personalization in federated learning, attaining exact, unbiased SGD in a setting with both shared (backbone) and client-specific (head) parameters. The methodology offers:

Exact distributed SGD and provable convergence under standard smoothness and bounded-variance assumptions
Substantially lower computational demands per round— $M$ 7 network passes compared to $M$ 8
Efficient communication, with minimal overhead relative to prevailing FL baselines
Empirically validated effectiveness in personalized FL, outperforming or matching standard methods in heterogeneity-heavy regimes

The approach delineated in PFLEGO represents a rigorous advance for federated algorithms capable of accommodating personalized model layers and non-i.i.d. data, reinforcing its relevance for large-scale, realistic FL deployments (Nikoloutsopoulos et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Personalized Federated Learning with Exact Stochastic Gradient Descent (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exact SGD with Personalization Layers (PFLEGO).

Exact SGD in Personalized Federated Learning (PFLEGO)

1. Federated Optimization Objective

2. Network Architecture and Personalization Layers

3. Optimization and Communication Protocol

4. Theoretical Properties and Convergence

5. Computational and Communication Efficiency

6. Empirical Evaluation

7. Summary and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Exact SGD in Personalized Federated Learning (PFLEGO)

1. Federated Optimization Objective

2. Network Architecture and Personalization Layers

3. Optimization and Communication Protocol

4. Theoretical Properties and Convergence

5. Computational and Communication Efficiency

6. Empirical Evaluation

7. Summary and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research