Federated Split Learning: Concepts and Advances

Updated 18 March 2026

Federated Split Learning is a distributed machine learning paradigm that divides a global model into client-side and server-side subnetworks to boost privacy and efficiency.
It unifies federated and split learning methodologies by using a predefined cut layer, enabling dynamic resource allocation and robust differential privacy across edge devices.
Empirical studies demonstrate that FSL can reduce communication load by up to 47% and significantly improve convergence rates in diverse real-world applications.

Federated Split Learning (FSL) is a distributed machine learning paradigm that unifies the client-parallelism and weight aggregation of Federated Learning (FL) with the privacy-preserving, client-side computation offloading of Split Learning (SL). FSL partitions deep models at a predefined “cut layer” into a client-side subnetwork and a server-side subnetwork, enabling collaborative training over edge networks without sharing raw data. It has been shown to improve round-wise communication efficiency, device scalability, and convergence rates in settings ranging from human activity recognition to large-scale image and time-series tasks. Recent advances further integrate rigorous differential privacy mechanisms, dynamic resource-aware partitioning, auxiliary local losses, model compression, and blockchain-based orchestration.

1. Federated Split Learning Principles and Workflow

In the canonical FSL architecture, the global neural network is separated at a “cut layer” into two functional blocks:

Client-side subnetwork: Parameters $\theta_c$ of dimension $u$ , executed on each edge device holding local data $D_n = \{(x_{n,k}, y_{n,k})\}$ .
Server-side subnetwork: Parameters $\theta_s$ of dimension $r$ , maintained on the central server.

The protocol operates in synchronized rounds:

Each client samples a mini-batch $B_n$ and computes intermediate activations $S_n = f_c(X_n; \theta_{c,n})$ up to the split point.
Gaussian noise may be applied for local differential privacy, $S̃_n = S_n + \mathcal{N}(0, \sigma_n^2 I)$ .
The server collects $S̃_n$ and $y_n$ from all clients, concatenates the activations, and completes the forward pass via $f_s(S; \theta_s)$ .
Backpropagation is split: the server updates $\theta_s$ with server-side gradients and computes cut-layer gradients distributed back to each client to update $\theta_{c,n}$ .
Federated aggregation (e.g., FedAvg) is applied to client-side weights, which are then broadcast to all clients for the next round.

This split-and-aggregate approach generalizes to settings with group-based sequential splits (Zhang et al., 2023), non-IID/heterogeneous clients (Asif et al., 5 Jan 2026), auxiliary local heads (Mu et al., 2023), or hierarchical device-edge-cloud deployments (Ni et al., 7 Oct 2025).

2. Privacy Mechanisms and Attack Surfaces

FSL enhances data privacy by ensuring that raw samples never leave the device, but cut-layer activations—so-called “smashed data”—could still leak information. To quantify and mitigate privacy risk:

Differential Privacy (DP): Local Rényi DP is achieved by adding calibrated Gaussian noise to client activations, guaranteeing that for any two neighboring datasets $X,X'$ , $D_\alpha(\mathcal{G}(X) \,\|\,\mathcal{G}(X'))\leq \epsilon_n$ per-round (Ndeko et al., 2024).
Noise Scaling: The standard deviation $\sigma_n$ is set relative to the $\ell_2$ -sensitivity $H$ of activations and target privacy budget $\epsilon_n$ , with full privacy budget accounting over $T$ rounds via moment accountants or Rényi DP advanced composition techniques.
Adversarial Reconstruction: Attack resilience is typically measured by the reconstructability of $x$ from $z$ via autoencoders or structural similarity (SSIM); this risk falls as the split layer is placed deeper, but at increased client energy cost (Lee et al., 2023).

Notably, FSL can outperform FL in privacy-utility trade-offs: for a fixed DP budget $\epsilon$ , FSL can yield 30–40% absolute improvements in model accuracy compared to federated learning with matching noise levels (Ndeko et al., 2024).

3. Communication and Computation Efficiency

FSL achieves considerable communication savings by restricting round-wise client uploads to low-dimensional smashed activations ( $O(bq)$ ) rather than full model weights ( $O(u+r)$ ). Empirical results for LSTM-HAR models on UCI HAR with batch size $b=32$ and $q\approx 100$ features demonstrated:

FL round time: $123\,\mathrm{s}$
FSL round time: $65\,\mathrm{s}$ implying approximately $2\times$ speedup and $\sim 47\%$ reduction in communication load at scale.

Group-based FSL variants partition clients into parallel groups, each performing intra-group split learning with local aggregation, further accelerating convergence in resource-limited wireless environments (Zhang et al., 2023). Newer approaches incorporate model compression (structured/unstructured pruning), gradient quantization (e.g., $8$-bit stochastic quantizers), and activation dropout, jointly reducing bandwidth and client computation costs with theoretically bounded impact on convergence and generalization (Zhang et al., 2024, Ni et al., 7 Oct 2025).

4. Robustness to Heterogeneity and Data Skew

A key motivation for split architectures is to accommodate device heterogeneity and non-IID label distributions. Advances include:

Label-Skew Correction: SCALA concatenates all client activations server-side and applies logit-adjusted cross-entropy, balancing class updates even under extreme distribution skew (e.g., each client observes only $2$ of $10$ classes), resulting in $8$–$20$ percentage-point accuracy gains over prior FL/SFL baselines (Yang et al., 2024).
Resource-Aware Partitioning: Clients choose individual cut layers according to memory, CPU, or link bandwidth constraints; optimal per-device split selection and wireless bandwidth allocation minimizes system-wide latency and maximizes overall training efficiency (Xu et al., 2023, Asif et al., 5 Jan 2026).
Personalized and Fair Learning: Multi-block splits and supplementary local heads enable transfer learning and personalized updates, ensuring high accuracy even for “thin” clients, with fairness guarantees on computation/workload allocation (Wadhwa et al., 2023, Yuan et al., 14 Aug 2025).

Token-fusion strategies across multimodal robots and collaborative robots in factories further exploit split points and resource-aware aggregation for robust, scalable, and low-latency training in industrial IoT (Ni et al., 7 Oct 2025).

5. Extensions: Auxiliary Losses, Decentralization, and Beyond

Communication/storage efficient FSL variants employ auxiliary networks (“heads”) at client cut layers to approximate the server loss, allowing for less frequent activation uploads (e.g., every $h$ mini-batches), with a single server model to eliminate $O(N)$ server memory growth (Mu et al., 2023, Mu et al., 21 Jul 2025). Formal convergence guarantees hold under mild assumptions (e.g., $O(1/\sqrt{T})$ for non-convex loss).

Super-network strategies (SuperSFL) sample client-specific subnetworks from a global weight-sharing backbone, dynamically fit to device-specific memory/latency profiles, fusing local and server gradients with depth- and loss-weighted aggregation, with up to $20\times$ communication reduction and $2$– $5\times$ round acceleration over baseline SFL (Asif et al., 5 Jan 2026).

Decentralized FSL on permissioned blockchains orchestrates split-training and FedAvg via transient fields and private data collections, fully removing central coordinators while maintaining near-centralized accuracy and scalable throughput (e.g., $94.1\%$ on CIFAR-10, $50\times$ clients) (Penedo et al., 10 Jul 2025).

FSL has also been successfully adapted for distributed sequential (RNN) learning over partitioned data, multimodal fusion, and privacy attacks/defenses (gradient inversion mitigation via zeroth-order optimization (Shi et al., 2024), local/cut-layer DP, and PixelDP techniques).

6. Experimental Performance and Practical Guidance

Empirical studies across real-world domains and simulated networks report:

UCI HAR FSL: $92\%$ peak accuracy (with DP $81\%$ at $\epsilon=80$ ), $7\%$ higher than FL, and $50\%$ round time reduction (Ndeko et al., 2024).
SCALA: Robust to extreme label skew ( $\alpha=2$ classes/client), e.g., $81.3\%$ (CIFAR-10), $60.7\%$ (CINIC-10), $43.1\%$ (CIFAR-100), outperforming FL/FedAvg/FedProx/Dyn baselines by up to $20$ points (Yang et al., 2024).
GSFL: $31.5\%$ end-to-end latency reduction at matched accuracy versus vanilla FL (Zhang et al., 2023).
Storage/comms scaling: $5$– $10\times$ bandwidth and $50$– $80\%$ server storage reduction (CSE-FSL, h=5–25) with $1$– $2\%$ accuracy degradation (Mu et al., 2023, Mu et al., 21 Jul 2025).
Decentralized FSL: Near-parity accuracy to centralized FSL, compressed epoch times (e.g., $30$ min vs $85$ min Ethereum-SL for CIFAR-10), and scalable ledger/network performance with stable latency up to $25$ clients (Penedo et al., 10 Jul 2025).

Table: Representative Communication Savings (CSE-FSL, CIFAR-10) | Method | Accuracy | Comm (GB) | Server Storage | |---------------|----------|-----------|---------------| | FSL_MC | 80.6% | 172.5 | 5.3M params | | CSE-FSL (h=5) | 76.5% | 18.1 | 1.6M params |

7. Open Directions and Limitations

Emerging challenges and future lines of investigation for FSL include:

Optimal, possibly adaptive, cut-layer selection using online learning or reinforcement learning agents (Lee et al., 2023).
Secure aggregation and robust DP mechanisms at both activations and server gradients.
Privacy-preserving extensions integrating with blockchain and secure multi-party protocols (Penedo et al., 10 Jul 2025).
Scaling to arbitrarily heterogeneous and intermittent clients by dynamic, resource-driven supernet partitioning (Asif et al., 5 Jan 2026).
Automated token fusion and split design in multimodal and hierarchical edge-cloud systems (Ni et al., 7 Oct 2025).
Trade-offs between accuracy, privacy resilience (activation invertibility), computation/energy, and communication remain an active frontier.

FSL represents a convergent paradigm that delivers substantial efficiency, scalability, and privacy gains over both FL and SL, with rich ongoing development spanning architecture, privacy, resource-aware deployment, and theoretical guarantees (Ndeko et al., 2024, Yang et al., 2024, Asif et al., 5 Jan 2026, Penedo et al., 10 Jul 2025).