Papers
Topics
Authors
Recent
Search
2000 character limit reached

SFL-V2: Split Federated Learning Variant

Updated 17 March 2026
  • SFL-V2 is an architectural paradigm that merges split learning and federated learning by using a common server-side subnetwork while aggregating client-side weights.
  • Empirical evaluations show that an early cut layer (Lc=1) often achieves higher accuracy, especially under non-IID data, outperforming traditional FedAvg.
  • The design involves sequential server updates with smashed activations and client-side local SGD, balancing accuracy gains against increased communication and privacy trade-offs.

Split Federated Learning variant 2 (SFL-V2) is an architectural and algorithmic paradigm in distributed machine learning that merges elements of split learning and federated learning. SFL-V2 specifically refers to the variant where the central server holds and updates a single, shared server-side subnetwork for all clients, distinguishing it from SFL-V1 where each client maintains a dedicated server-side model instance. The choice of cut layer—the network depth at which the split between client and server occurs—has a pronounced effect on both model performance and system behavior, especially under data heterogeneity. The following sections elucidate SFL-V2's design, mathematical formulation, empirical characteristics, and trade-offs, referencing the quantitative study and technical analysis in "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).

1. Architectural Principles of SFL-V2

A global deep network is partitioned at a specific cut layer LcL_c, yielding:

  • Client-side subnetworks fck(;wck)f_c^k(\cdot\,;w_c^k): Layers 1,,Lc1,\ldots,L_c resident on each client kk.
  • Server-side shared subnetwork fs(;ws)f_s(\cdot\,;w_s): Layers Lc+1,,LL_c+1,\ldots,L residing on the central Training Server (TS).

SFL-V2 employs two conceptual servers:

  • Model Synchronization Server (MSS): Aggregates only the client-side weights via FedAvg-style averaging, with no server-side model aggregation.
  • Training Server (TS): Maintains and sequentially updates a single, global wsw_s based on activations collected from all clients.

The round-based workflow comprises:

  1. Clients compute activations using fckf_c^k and transmit (akt,ykt)(a_k^t, y_k^t) (activations and targets) to the TS.
  2. The TS, for a random permutation π\pi of clients, conducts forward passes through fsf_s, computes losses, gradients, and updates wsw_s in sequence, and returns smashed-data gradients δkt\delta_k^t to the respective clients.
  3. Clients backpropagate δkt\delta_k^t through fckf_c^k, perform local SGD over EE epochs, and forward updated weights to the MSS for aggregation.
  4. The aggregated wct+1w_c^{t+1} is synchronized across all clients for the next round, while wsw_s is carried over unaveraged.

2. Mathematical and Algorithmic Foundations

The SFL-V2 objective is

min{wck},ws  f({wck},ws)=k=1KαkE(x,y)Dk[(fs(fck(x;wck);ws),y)],\min_{\,\{w_c^k\},\,w_s}\; f\bigl(\{w_c^k\},w_s\bigr) = \sum_{k=1}^K \alpha_k \, \mathbb{E}_{(x,y)\sim \mathcal{D}_k}\left[\ell\left(f_s\left(f_c^k(x;w_c^k);w_s\right), y\right)\right],

where αk=Dk/jDj\alpha_k = D_k / \sum_j D_j expresses data proportion, and Dk\mathcal{D}_k is the local distribution.

The iterative protocol is:

  • Client forward: akt=fck(xkt;wck,t)a_k^t = f_c^k(x_k^t;w_c^{k,t}), y^kt=fs(akt;wst)\hat y_k^t = f_s(a_k^t;w_s^t).
  • Server update (permuted sequence):

    gsk=ws(y^kt,ykt),wstwstηgskg_s^k = \nabla_{w_s}\, \ell(\hat y_k^t, y_k^t), \quad w_s^{t} \leftarrow w_s^{t} - \eta\, g_s^k

    and δkt=ak(y^kt,ykt)\delta_k^t = \nabla_{a_k} \ell(\hat y_k^t, y_k^t).

  • Client backward:

    gck=(δkt)wckfck(xkt;wck,t),wck,twck,tηgckg_c^k = (\delta_k^t)^\top \nabla_{w_c^k} f_c^k(x_k^t;w_c^{k,t}), \quad w_c^{k,t} \leftarrow w_c^{k,t} - \eta\, g_c^k

    for EE epochs.

  • Aggregation:

    wct+1=k=1Kαkwck,t,wck,t+1wct+1w_c^{t+1} = \sum_{k=1}^K \alpha_k w_c^{k,t}, \quad w_c^{k,t+1} \leftarrow w_c^{t+1}

    for all kk. The server-side wsw_s is not averaged across clients.

This mechanism imposes that only client-side weights are aggregated, while the shared wsw_s accumulates updates corresponding to the sequence of all client activations per round.

3. Convergence Properties and Theoretical Status

While the paper derives a full convergence bound for SFL-V1, demonstrating that it is invariant to choice of LcL_c, there is no explicit convergence-rate lemma or proof for SFL-V2. The authors note:

"A convergence proof for SFL-V2 would require a significantly different approach and is left to future work." (Dachille et al., 2024)

Empirical results indicate that both convergence speed and final test accuracy of SFL-V2 are highly sensitive to the cut layer LcL_c, in sharp contrast to SFL-V1, but no theoretical characterization of this dependency currently exists.

A plausible implication is the need for deeper theoretical understanding of the optimization landscape induced by the sequential update and persistent server subnetwork in SFL-V2, particularly as LcL_c approaches the input or output extremes.

4. Experimental Methodology and Performance Analysis

Experiments are conducted on four image classification datasets (HAM10000, CIFAR-10, CIFAR-100, TinyImageNet) and two model architectures (ResNet-18 and ResNet-50), with K=100K=100 clients for most tasks and up to T=300T=300 communication rounds. Cut layers LcL_c are chosen immediately after the ii-th macro-residual block (i=1,2,3,4i=1,2,3,4).

Data heterogeneity is evaluated both under IID (μ=\mu=\infty) and non-IID (Dirichlet μ=0.1\mu=0.1) settings. HAM10000 is considered inherently imbalanced and only non-IID for the other cases.

Hyperparameters are carefully controlled: SFL-V2 is trained with Adam (learning rate 0.001, batch size 64, E=5E=5), while the FedAvg baseline uses SGD (lr 0.01) to focus on architectural rather than optimizer effects.

The following tables summarize test accuracy means and standard deviations as a function of cut layer:

Dataset Lc=1L_c=1 Lc=2L_c=2 Lc=3L_c=3 Lc=4L_c=4
CIFAR-10 (IID) 92.30% ± 0.15 89.56% ± 0.15 87.57% ± 0.34 86.11% ± 0.36
CIFAR-100 (IID) 56.00% ± 0.69 45.93% ± 0.21 41.29% ± 0.26 45.98% ± 0.24
TinyImageNet (IID) 54.77% ± 0.22 50.86% ± 0.05 44.15% ± 0.11 43.85% ± 0.25
HAM10000 (non-IID) 80.58% ± 0.35 79.56% ± 0.88 79.26% ± 0.54 78.27% ± 0.65
CIFAR-10 (non-IID) 67.58% ± 6.28 59.98% ± 11.99 61.73% ± 4.04 69.45% ± 0.64
CIFAR-100 (non-IID) 52.38% ± 0.97 45.90% ± 1.08 42.42% ± 0.99 43.31% ± 1.17
TinyImageNet (non-IID) 30.14% ± 8.58 22.55% ± 2.28 27.08% ± 1.48 26.42% ± 1.52

The key empirical findings are:

  • For three of four tasks, the earliest possible split (Lc=1L_c=1) achieves the highest accuracy.
  • On CIFAR-10 (non-IID), Lc=4L_c=4 very marginally outperforms Lc=1L_c=1; this is ascribed to low class count and shallow feature complexity.
  • As LcL_c decreases, the solution approaches that of centralized learning (more computation centralized in wsw_s), while large LcL_c trends toward FedAvg's behavior.

5. Comparative Analysis: SFL-V2 Versus FedAvg in Heterogeneous Regimes

Under non-IID conditions, SFL-V2 at its empirically optimal cut layer (Lc=1L_c=1) is compared to standard FedAvg:

Dataset FedAvg (%) SFL-V2 (Lc=1L_c=1) (%) Accuracy Gain (Δ\Delta)
HAM10000 77.37 ± 0.35 80.58 ± 0.35 +3.21
CIFAR-10 67.59 ± 2.52 67.58 ± 6.28 \approx 0
CIFAR-100 42.60 ± 1.18 52.38 ± 0.97 +9.78
TinyImageNet 28.33 ± 0.28 30.14 ± 8.58 +1.81

SFL-V2 demonstrates substantial improvements over FedAvg in high-heterogeneity settings (notably, CIFAR-100 gains nearly 10 percentage points). This behavior is attributed to the centralized server-side representation learning, where wsw_s incorporates activation information from all clients, directly mitigating client drift that arises in Federated Averaging.

A plausible implication is that SFL-V2, with strategic cut layer placement, leverages shared representation power more efficiently than per-client independent learning, especially in highly skewed data regimes.

6. Trade-offs, Practical Considerations, and Open Questions

Early splits (small LcL_c) concentrate learning in wsw_s, yielding superior accuracy but imposing larger per-round communication loads (due to larger "smashed activation" tensors transferred) and increasing computation on the TS. This also increases the privacy exposure of raw features, as more information is centralized. Conversely, deeper cuts reduce these demands but degrade performance to that of pure FedAvg.

While empirical trends are decisive, the absence of a formal convergence guarantee for SFL-V2 remains unresolved. Practical deployment thus must weigh empirical gains, system scalability, communication-to-computation balance, and privacy policies according to the task's operating environment.

A further implication is that the optimal positioning of LcL_c in real-world systems may be dataset-dependent and sensitive to the degree of data heterogeneity, client resource profiles, and regulatory or risk constraints regarding feature visibility by the TS.


Empirical and architectural details referenced herein derive from "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SFL-V2.