SFL-V2: Split Federated Learning Variant
- SFL-V2 is an architectural paradigm that merges split learning and federated learning by using a common server-side subnetwork while aggregating client-side weights.
- Empirical evaluations show that an early cut layer (Lc=1) often achieves higher accuracy, especially under non-IID data, outperforming traditional FedAvg.
- The design involves sequential server updates with smashed activations and client-side local SGD, balancing accuracy gains against increased communication and privacy trade-offs.
Split Federated Learning variant 2 (SFL-V2) is an architectural and algorithmic paradigm in distributed machine learning that merges elements of split learning and federated learning. SFL-V2 specifically refers to the variant where the central server holds and updates a single, shared server-side subnetwork for all clients, distinguishing it from SFL-V1 where each client maintains a dedicated server-side model instance. The choice of cut layer—the network depth at which the split between client and server occurs—has a pronounced effect on both model performance and system behavior, especially under data heterogeneity. The following sections elucidate SFL-V2's design, mathematical formulation, empirical characteristics, and trade-offs, referencing the quantitative study and technical analysis in "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).
1. Architectural Principles of SFL-V2
A global deep network is partitioned at a specific cut layer , yielding:
- Client-side subnetworks : Layers resident on each client .
- Server-side shared subnetwork : Layers residing on the central Training Server (TS).
SFL-V2 employs two conceptual servers:
- Model Synchronization Server (MSS): Aggregates only the client-side weights via FedAvg-style averaging, with no server-side model aggregation.
- Training Server (TS): Maintains and sequentially updates a single, global based on activations collected from all clients.
The round-based workflow comprises:
- Clients compute activations using and transmit (activations and targets) to the TS.
- The TS, for a random permutation of clients, conducts forward passes through , computes losses, gradients, and updates in sequence, and returns smashed-data gradients to the respective clients.
- Clients backpropagate through , perform local SGD over epochs, and forward updated weights to the MSS for aggregation.
- The aggregated is synchronized across all clients for the next round, while is carried over unaveraged.
2. Mathematical and Algorithmic Foundations
The SFL-V2 objective is
where expresses data proportion, and is the local distribution.
The iterative protocol is:
- Client forward: , .
- Server update (permuted sequence):
and .
- Client backward:
for epochs.
- Aggregation:
for all . The server-side is not averaged across clients.
This mechanism imposes that only client-side weights are aggregated, while the shared accumulates updates corresponding to the sequence of all client activations per round.
3. Convergence Properties and Theoretical Status
While the paper derives a full convergence bound for SFL-V1, demonstrating that it is invariant to choice of , there is no explicit convergence-rate lemma or proof for SFL-V2. The authors note:
"A convergence proof for SFL-V2 would require a significantly different approach and is left to future work." (Dachille et al., 2024)
Empirical results indicate that both convergence speed and final test accuracy of SFL-V2 are highly sensitive to the cut layer , in sharp contrast to SFL-V1, but no theoretical characterization of this dependency currently exists.
A plausible implication is the need for deeper theoretical understanding of the optimization landscape induced by the sequential update and persistent server subnetwork in SFL-V2, particularly as approaches the input or output extremes.
4. Experimental Methodology and Performance Analysis
Experiments are conducted on four image classification datasets (HAM10000, CIFAR-10, CIFAR-100, TinyImageNet) and two model architectures (ResNet-18 and ResNet-50), with clients for most tasks and up to communication rounds. Cut layers are chosen immediately after the -th macro-residual block ().
Data heterogeneity is evaluated both under IID () and non-IID (Dirichlet ) settings. HAM10000 is considered inherently imbalanced and only non-IID for the other cases.
Hyperparameters are carefully controlled: SFL-V2 is trained with Adam (learning rate 0.001, batch size 64, ), while the FedAvg baseline uses SGD (lr 0.01) to focus on architectural rather than optimizer effects.
The following tables summarize test accuracy means and standard deviations as a function of cut layer:
| Dataset | ||||
|---|---|---|---|---|
| CIFAR-10 (IID) | 92.30% ± 0.15 | 89.56% ± 0.15 | 87.57% ± 0.34 | 86.11% ± 0.36 |
| CIFAR-100 (IID) | 56.00% ± 0.69 | 45.93% ± 0.21 | 41.29% ± 0.26 | 45.98% ± 0.24 |
| TinyImageNet (IID) | 54.77% ± 0.22 | 50.86% ± 0.05 | 44.15% ± 0.11 | 43.85% ± 0.25 |
| HAM10000 (non-IID) | 80.58% ± 0.35 | 79.56% ± 0.88 | 79.26% ± 0.54 | 78.27% ± 0.65 |
| CIFAR-10 (non-IID) | 67.58% ± 6.28 | 59.98% ± 11.99 | 61.73% ± 4.04 | 69.45% ± 0.64 |
| CIFAR-100 (non-IID) | 52.38% ± 0.97 | 45.90% ± 1.08 | 42.42% ± 0.99 | 43.31% ± 1.17 |
| TinyImageNet (non-IID) | 30.14% ± 8.58 | 22.55% ± 2.28 | 27.08% ± 1.48 | 26.42% ± 1.52 |
The key empirical findings are:
- For three of four tasks, the earliest possible split () achieves the highest accuracy.
- On CIFAR-10 (non-IID), very marginally outperforms ; this is ascribed to low class count and shallow feature complexity.
- As decreases, the solution approaches that of centralized learning (more computation centralized in ), while large trends toward FedAvg's behavior.
5. Comparative Analysis: SFL-V2 Versus FedAvg in Heterogeneous Regimes
Under non-IID conditions, SFL-V2 at its empirically optimal cut layer () is compared to standard FedAvg:
| Dataset | FedAvg (%) | SFL-V2 () (%) | Accuracy Gain () |
|---|---|---|---|
| HAM10000 | 77.37 ± 0.35 | 80.58 ± 0.35 | +3.21 |
| CIFAR-10 | 67.59 ± 2.52 | 67.58 ± 6.28 | 0 |
| CIFAR-100 | 42.60 ± 1.18 | 52.38 ± 0.97 | +9.78 |
| TinyImageNet | 28.33 ± 0.28 | 30.14 ± 8.58 | +1.81 |
SFL-V2 demonstrates substantial improvements over FedAvg in high-heterogeneity settings (notably, CIFAR-100 gains nearly 10 percentage points). This behavior is attributed to the centralized server-side representation learning, where incorporates activation information from all clients, directly mitigating client drift that arises in Federated Averaging.
A plausible implication is that SFL-V2, with strategic cut layer placement, leverages shared representation power more efficiently than per-client independent learning, especially in highly skewed data regimes.
6. Trade-offs, Practical Considerations, and Open Questions
Early splits (small ) concentrate learning in , yielding superior accuracy but imposing larger per-round communication loads (due to larger "smashed activation" tensors transferred) and increasing computation on the TS. This also increases the privacy exposure of raw features, as more information is centralized. Conversely, deeper cuts reduce these demands but degrade performance to that of pure FedAvg.
While empirical trends are decisive, the absence of a formal convergence guarantee for SFL-V2 remains unresolved. Practical deployment thus must weigh empirical gains, system scalability, communication-to-computation balance, and privacy policies according to the task's operating environment.
A further implication is that the optimal positioning of in real-world systems may be dataset-dependent and sensitive to the degree of data heterogeneity, client resource profiles, and regulatory or risk constraints regarding feature visibility by the TS.
Empirical and architectural details referenced herein derive from "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).