SFL-V2: Split Federated Learning Variant

Updated 17 March 2026

SFL-V2 is an architectural paradigm that merges split learning and federated learning by using a common server-side subnetwork while aggregating client-side weights.
Empirical evaluations show that an early cut layer (Lc=1) often achieves higher accuracy, especially under non-IID data, outperforming traditional FedAvg.
The design involves sequential server updates with smashed activations and client-side local SGD, balancing accuracy gains against increased communication and privacy trade-offs.

Split Federated Learning variant 2 (SFL-V2) is an architectural and algorithmic paradigm in distributed machine learning that merges elements of split learning and federated learning. SFL-V2 specifically refers to the variant where the central server holds and updates a single, shared server-side subnetwork for all clients, distinguishing it from SFL-V1 where each client maintains a dedicated server-side model instance. The choice of cut layer—the network depth at which the split between client and server occurs—has a pronounced effect on both model performance and system behavior, especially under data heterogeneity. The following sections elucidate SFL-V2's design, mathematical formulation, empirical characteristics, and trade-offs, referencing the quantitative study and technical analysis in "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).

1. Architectural Principles of SFL-V2

A global deep network is partitioned at a specific cut layer $L_c$ , yielding:

Client-side subnetworks $f_c^k(\cdot\,;w_c^k)$ : Layers $1,\ldots,L_c$ resident on each client $k$ .
Server-side shared subnetwork $f_s(\cdot\,;w_s)$ : Layers $L_c+1,\ldots,L$ residing on the central Training Server (TS).

SFL-V2 employs two conceptual servers:

Model Synchronization Server (MSS): Aggregates only the client-side weights via FedAvg-style averaging, with no server-side model aggregation.
Training Server (TS): Maintains and sequentially updates a single, global $w_s$ based on activations collected from all clients.

The round-based workflow comprises:

Clients compute activations using $f_c^k$ and transmit $(a_k^t, y_k^t)$ (activations and targets) to the TS.
The TS, for a random permutation $\pi$ of clients, conducts forward passes through $f_s$ , computes losses, gradients, and updates $w_s$ in sequence, and returns smashed-data gradients $\delta_k^t$ to the respective clients.
Clients backpropagate $\delta_k^t$ through $f_c^k$ , perform local SGD over $E$ epochs, and forward updated weights to the MSS for aggregation.
The aggregated $w_c^{t+1}$ is synchronized across all clients for the next round, while $w_s$ is carried over unaveraged.

2. Mathematical and Algorithmic Foundations

The SFL-V2 objective is

$\min_{\,\{w_c^k\},\,w_s}\; f\bigl(\{w_c^k\},w_s\bigr) = \sum_{k=1}^K \alpha_k \, \mathbb{E}_{(x,y)\sim \mathcal{D}_k}\left[\ell\left(f_s\left(f_c^k(x;w_c^k);w_s\right), y\right)\right],$

where $\alpha_k = D_k / \sum_j D_j$ expresses data proportion, and $\mathcal{D}_k$ is the local distribution.

The iterative protocol is:

Client forward: $a_k^t = f_c^k(x_k^t;w_c^{k,t})$ , $\hat y_k^t = f_s(a_k^t;w_s^t)$ .
Server update (permuted sequence):

$g_s^k = \nabla_{w_s}\, \ell(\hat y_k^t, y_k^t), \quad w_s^{t} \leftarrow w_s^{t} - \eta\, g_s^k$

and $\delta_k^t = \nabla_{a_k} \ell(\hat y_k^t, y_k^t)$ .
Client backward:

$g_c^k = (\delta_k^t)^\top \nabla_{w_c^k} f_c^k(x_k^t;w_c^{k,t}), \quad w_c^{k,t} \leftarrow w_c^{k,t} - \eta\, g_c^k$

for $E$ epochs.
Aggregation:

$w_c^{t+1} = \sum_{k=1}^K \alpha_k w_c^{k,t}, \quad w_c^{k,t+1} \leftarrow w_c^{t+1}$

for all $k$ . The server-side $w_s$ is not averaged across clients.

This mechanism imposes that only client-side weights are aggregated, while the shared $w_s$ accumulates updates corresponding to the sequence of all client activations per round.

3. Convergence Properties and Theoretical Status

While the paper derives a full convergence bound for SFL-V1, demonstrating that it is invariant to choice of $L_c$ , there is no explicit convergence-rate lemma or proof for SFL-V2. The authors note:

"A convergence proof for SFL-V2 would require a significantly different approach and is left to future work." (Dachille et al., 2024)

Empirical results indicate that both convergence speed and final test accuracy of SFL-V2 are highly sensitive to the cut layer $L_c$ , in sharp contrast to SFL-V1, but no theoretical characterization of this dependency currently exists.

A plausible implication is the need for deeper theoretical understanding of the optimization landscape induced by the sequential update and persistent server subnetwork in SFL-V2, particularly as $L_c$ approaches the input or output extremes.

4. Experimental Methodology and Performance Analysis

Experiments are conducted on four image classification datasets (HAM10000, CIFAR-10, CIFAR-100, TinyImageNet) and two model architectures (ResNet-18 and ResNet-50), with $K=100$ clients for most tasks and up to $T=300$ communication rounds. Cut layers $L_c$ are chosen immediately after the $i$ -th macro-residual block ( $i=1,2,3,4$ ).

Data heterogeneity is evaluated both under IID ( $\mu=\infty$ ) and non-IID (Dirichlet $\mu=0.1$ ) settings. HAM10000 is considered inherently imbalanced and only non-IID for the other cases.

Hyperparameters are carefully controlled: SFL-V2 is trained with Adam (learning rate 0.001, batch size 64, $E=5$ ), while the FedAvg baseline uses SGD (lr 0.01) to focus on architectural rather than optimizer effects.

The following tables summarize test accuracy means and standard deviations as a function of cut layer:

Dataset	$L_c=1$	$L_c=2$	$L_c=3$	$L_c=4$
CIFAR-10 (IID)	92.30% ± 0.15	89.56% ± 0.15	87.57% ± 0.34	86.11% ± 0.36
CIFAR-100 (IID)	56.00% ± 0.69	45.93% ± 0.21	41.29% ± 0.26	45.98% ± 0.24
TinyImageNet (IID)	54.77% ± 0.22	50.86% ± 0.05	44.15% ± 0.11	43.85% ± 0.25
HAM10000 (non-IID)	80.58% ± 0.35	79.56% ± 0.88	79.26% ± 0.54	78.27% ± 0.65
CIFAR-10 (non-IID)	67.58% ± 6.28	59.98% ± 11.99	61.73% ± 4.04	69.45% ± 0.64
CIFAR-100 (non-IID)	52.38% ± 0.97	45.90% ± 1.08	42.42% ± 0.99	43.31% ± 1.17
TinyImageNet (non-IID)	30.14% ± 8.58	22.55% ± 2.28	27.08% ± 1.48	26.42% ± 1.52

The key empirical findings are:

For three of four tasks, the earliest possible split ( $L_c=1$ ) achieves the highest accuracy.
On CIFAR-10 (non-IID), $L_c=4$ very marginally outperforms $L_c=1$ ; this is ascribed to low class count and shallow feature complexity.
As $L_c$ decreases, the solution approaches that of centralized learning (more computation centralized in $w_s$ ), while large $L_c$ trends toward FedAvg's behavior.

5. Comparative Analysis: SFL-V2 Versus FedAvg in Heterogeneous Regimes

Under non-IID conditions, SFL-V2 at its empirically optimal cut layer ( $L_c=1$ ) is compared to standard FedAvg:

Dataset	FedAvg (%)	SFL-V2 ( $L_c=1$ ) (%)	Accuracy Gain ( $\Delta$ )
HAM10000	77.37 ± 0.35	80.58 ± 0.35	+3.21
CIFAR-10	67.59 ± 2.52	67.58 ± 6.28	$\approx$ 0
CIFAR-100	42.60 ± 1.18	52.38 ± 0.97	+9.78
TinyImageNet	28.33 ± 0.28	30.14 ± 8.58	+1.81

SFL-V2 demonstrates substantial improvements over FedAvg in high-heterogeneity settings (notably, CIFAR-100 gains nearly 10 percentage points). This behavior is attributed to the centralized server-side representation learning, where $w_s$ incorporates activation information from all clients, directly mitigating client drift that arises in Federated Averaging.

A plausible implication is that SFL-V2, with strategic cut layer placement, leverages shared representation power more efficiently than per-client independent learning, especially in highly skewed data regimes.

6. Trade-offs, Practical Considerations, and Open Questions

Early splits (small $L_c$ ) concentrate learning in $w_s$ , yielding superior accuracy but imposing larger per-round communication loads (due to larger "smashed activation" tensors transferred) and increasing computation on the TS. This also increases the privacy exposure of raw features, as more information is centralized. Conversely, deeper cuts reduce these demands but degrade performance to that of pure FedAvg.

While empirical trends are decisive, the absence of a formal convergence guarantee for SFL-V2 remains unresolved. Practical deployment thus must weigh empirical gains, system scalability, communication-to-computation balance, and privacy policies according to the task's operating environment.

A further implication is that the optimal positioning of $L_c$ in real-world systems may be dataset-dependent and sensitive to the degree of data heterogeneity, client resource profiles, and regulatory or risk constraints regarding feature visibility by the TS.

Empirical and architectural details referenced herein derive from "The Impact of Cut Layer Selection in Split Federated Learning" (Dachille et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

The Impact of Cut Layer Selection in Split Federated Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SFL-V2.