Exact SGD in Personalized Federated Learning (PFLEGO)
- The paper introduces PFLEGO, an algorithm that performs exact, unbiased SGD using a partial update schedule and layer-specific computation to tackle client heterogeneity.
- It builds on the FedPer architecture, splitting models into a shared backbone and client-specific head to optimize both computational load and communication costs.
- Empirical results on datasets like MNIST and CIFAR-10 demonstrate that PFLEGO outperforms traditional federated methods in high-personalization regimes.
Exact Stochastic Gradient Descent in Personalized Federated Learning with Personalization Layers (PFLEGO) is an algorithmic framework that addresses the challenges of client data heterogeneity in federated learning. Its defining property is performing unbiased, exact SGD over both shared and client-personalized parameters using a partial update schedule and layer-specific computation, resulting in strong convergence guarantees and efficient implementation. PFLEGO is especially tailored to regimes with high personalization demand, where classical FL methods are suboptimal.
1. Federated Optimization Objective
PFLEGO considers a federated scenario with clients, each equipped with a local labeled dataset . The per-client loss is
where represent the common (shared) model weights (backbone), and are the personalized classifier head weights for client ( is the shared feature dimension and the number of classes for client ). Classically, FL attempts to minimize
where 0 is the sample-size-based client weight. This structure formalizes the combined global objective, aligning with the non-i.i.d., heterogeneously partitioned data typical of real-world FL settings (Nikoloutsopoulos et al., 2022).
2. Network Architecture and Personalization Layers
PFLEGO builds on the FedPer neural network architecture, decomposing the model per client into:
- Shared backbone: A deep feature extractor, parameterized by 1, mapping 2. For example, a 200-unit MLP backbone for MNIST/Fashion-MNIST/EMNIST, two convolutional and two fully connected layers for CIFAR-10 (output 3), and four convolutional layers for Omniglot (4).
- Client-specific head: A single linear layer 5 mapping backbone features to logits for 6 client-specific classes.
At client 7, the model computes 8, ensuring a split between universal representation and per-client adaptation.
3. Optimization and Communication Protocol
Each federated round 9 consists of three core phases:
3.1 Client Sampling: The server samples a subset 0 of 1 clients (either binomially with probability 2 or as a fixed-size set).
3.2 Local Personalized Updates (Client Side):
- The server broadcasts 3 to clients in 4.
- Each client keeps its local personalized head 5.
- For steps 6 to 7, the client performs gradient descent updates only on 8:
9
- At step 0, compute joint gradients:
1
then update 2 one final time with unbiased scaling:
3
and send 4 back to the server.
3.3 Exact Server-Side SGD Aggregation:
- The server aggregates received 5 as
6
and updates
7
- The expected update matches the global gradient, ensuring unbiasedness:
8
- The overall process yields an exact unbiased SGD step over the combined parameter vector
9
4. Theoretical Properties and Convergence
PFLEGO operates under the assumptions that 0 is 1-smooth and bounded below, and that the stochastic gradients 2 are unbiased with bounded variance (3).
- Unbiasedness: The rescaling by 4 neutralizes the selection probability, ensuring
5
- Convergence rate: Under standard Robbins-Monro conditions (6), the minimum expected squared gradient norm admits
7
- Special cases: For 8 (a single local iteration), PFLEGO is mathematically equivalent to classical SGD. For 9, the initial 0 steps often expedite local 1 minimization, contributing to accelerated convergence in practice.
5. Computational and Communication Efficiency
Let 2 denote the cost of a complete forward and backward pass through both backbone and head layers.
- PFLEGO: Each client performs one full forward to cache features, 3 extremely cheap personalization steps (matrix-vector updates on 4), and a single full forward-backward pass for the joint gradients. Cumulative cost per round is approximately 5, i.e., 6 model passes per round.
- FedAvg / FedPer: These baselines require 7 full network passes per round, resulting in 8 per-round cost.
| Method | Client Compute per Round | Communication (Client→Server) |
|---|---|---|
| PFLEGO | 9 (0) | 1 (gradients only) |
| FedAvg | 2 | 3 (full weights) |
| FedPer | 4 | 5 |
Communication in all methods includes broadcasting 6 to clients. PFLEGO thus achieves exact, unbiased SGD with substantially lower per-round compute and communication comparable or favorable to alternatives (Nikoloutsopoulos et al., 2022).
6. Empirical Evaluation
PFLEGO was evaluated across five multi-class benchmarks (MNIST, Fashion-MNIST, EMNIST, CIFAR-10, Omniglot) in three personalization regimes: high-personalization (each client sees only 2 classes), medium-personalization (half the classes), and no-personalization (all classes). The architectures for each dataset use the backbone structures described above.
Baselines for comparison include FedAvg, FedPer, and FedRecon (block-coordinate variant). Metrics include global training loss and final test accuracy averaged over the last 10 rounds.
| Dataset | FedPer | FedAvg | PFLEGO |
|---|---|---|---|
| MNIST | 97.88±0.25% | 97.54±0.25% | 98.43±0.21% |
| CIFAR-10 | 85.15±1.08% | 85.18±0.96% | 87.81±0.94% |
| EMNIST | 97.78±0.51% | 97.29±0.54% | 98.49±0.43% |
| Fashion-MNIST | 96.14±0.35% | 96.35±0.47% | 96.34±0.43% |
| Omniglot | 68.02±1.74% | 49.65±1.74% | 74.56±1.23% |
In high-personalization, PFLEGO outperforms baselines in accuracy across datasets. Convergence plots show PFLEGO achieves lower training loss in fewer rounds, especially under high heterogeneity. Rate ablations reveal that higher client learning rates (β) and participation rates (r) accelerate convergence.
In medium-personalization, PFLEGO typically matches or outperforms alternatives. In the no-personalization regime, FedAvg can be optimal, as expected. These outcomes suggest PFLEGO delivers strong empirical performance under non-i.i.d. data splitting and significant class heterogeneity.
7. Summary and Significance
PFLEGO realizes a principled approach to distributed personalization in federated learning, attaining exact, unbiased SGD in a setting with both shared (backbone) and client-specific (head) parameters. The methodology offers:
- Exact distributed SGD and provable convergence under standard smoothness and bounded-variance assumptions
- Substantially lower computational demands per round—7 network passes compared to 8
- Efficient communication, with minimal overhead relative to prevailing FL baselines
- Empirically validated effectiveness in personalized FL, outperforming or matching standard methods in heterogeneity-heavy regimes
The approach delineated in PFLEGO represents a rigorous advance for federated algorithms capable of accommodating personalized model layers and non-i.i.d. data, reinforcing its relevance for large-scale, realistic FL deployments (Nikoloutsopoulos et al., 2022).