Federated Split Learning: Concepts and Advances
- Federated Split Learning is a distributed machine learning paradigm that divides a global model into client-side and server-side subnetworks to boost privacy and efficiency.
- It unifies federated and split learning methodologies by using a predefined cut layer, enabling dynamic resource allocation and robust differential privacy across edge devices.
- Empirical studies demonstrate that FSL can reduce communication load by up to 47% and significantly improve convergence rates in diverse real-world applications.
Federated Split Learning (FSL) is a distributed machine learning paradigm that unifies the client-parallelism and weight aggregation of Federated Learning (FL) with the privacy-preserving, client-side computation offloading of Split Learning (SL). FSL partitions deep models at a predefined “cut layer” into a client-side subnetwork and a server-side subnetwork, enabling collaborative training over edge networks without sharing raw data. It has been shown to improve round-wise communication efficiency, device scalability, and convergence rates in settings ranging from human activity recognition to large-scale image and time-series tasks. Recent advances further integrate rigorous differential privacy mechanisms, dynamic resource-aware partitioning, auxiliary local losses, model compression, and blockchain-based orchestration.
1. Federated Split Learning Principles and Workflow
In the canonical FSL architecture, the global neural network is separated at a “cut layer” into two functional blocks:
- Client-side subnetwork: Parameters of dimension , executed on each edge device holding local data .
- Server-side subnetwork: Parameters of dimension , maintained on the central server.
The protocol operates in synchronized rounds:
- Each client samples a mini-batch and computes intermediate activations up to the split point.
- Gaussian noise may be applied for local differential privacy, .
- The server collects and from all clients, concatenates the activations, and completes the forward pass via .
- Backpropagation is split: the server updates with server-side gradients and computes cut-layer gradients distributed back to each client to update .
- Federated aggregation (e.g., FedAvg) is applied to client-side weights, which are then broadcast to all clients for the next round.
This split-and-aggregate approach generalizes to settings with group-based sequential splits (Zhang et al., 2023), non-IID/heterogeneous clients (Asif et al., 5 Jan 2026), auxiliary local heads (Mu et al., 2023), or hierarchical device-edge-cloud deployments (Ni et al., 7 Oct 2025).
2. Privacy Mechanisms and Attack Surfaces
FSL enhances data privacy by ensuring that raw samples never leave the device, but cut-layer activations—so-called “smashed data”—could still leak information. To quantify and mitigate privacy risk:
- Differential Privacy (DP): Local Rényi DP is achieved by adding calibrated Gaussian noise to client activations, guaranteeing that for any two neighboring datasets , per-round (Ndeko et al., 2024).
- Noise Scaling: The standard deviation is set relative to the -sensitivity of activations and target privacy budget , with full privacy budget accounting over rounds via moment accountants or Rényi DP advanced composition techniques.
- Adversarial Reconstruction: Attack resilience is typically measured by the reconstructability of from via autoencoders or structural similarity (SSIM); this risk falls as the split layer is placed deeper, but at increased client energy cost (Lee et al., 2023).
Notably, FSL can outperform FL in privacy-utility trade-offs: for a fixed DP budget , FSL can yield 30–40% absolute improvements in model accuracy compared to federated learning with matching noise levels (Ndeko et al., 2024).
3. Communication and Computation Efficiency
FSL achieves considerable communication savings by restricting round-wise client uploads to low-dimensional smashed activations () rather than full model weights (). Empirical results for LSTM-HAR models on UCI HAR with batch size and features demonstrated:
- FL round time:
- FSL round time: implying approximately speedup and reduction in communication load at scale.
Group-based FSL variants partition clients into parallel groups, each performing intra-group split learning with local aggregation, further accelerating convergence in resource-limited wireless environments (Zhang et al., 2023). Newer approaches incorporate model compression (structured/unstructured pruning), gradient quantization (e.g., $8$-bit stochastic quantizers), and activation dropout, jointly reducing bandwidth and client computation costs with theoretically bounded impact on convergence and generalization (Zhang et al., 2024, Ni et al., 7 Oct 2025).
4. Robustness to Heterogeneity and Data Skew
A key motivation for split architectures is to accommodate device heterogeneity and non-IID label distributions. Advances include:
- Label-Skew Correction: SCALA concatenates all client activations server-side and applies logit-adjusted cross-entropy, balancing class updates even under extreme distribution skew (e.g., each client observes only $2$ of $10$ classes), resulting in $8$–$20$ percentage-point accuracy gains over prior FL/SFL baselines (Yang et al., 2024).
- Resource-Aware Partitioning: Clients choose individual cut layers according to memory, CPU, or link bandwidth constraints; optimal per-device split selection and wireless bandwidth allocation minimizes system-wide latency and maximizes overall training efficiency (Xu et al., 2023, Asif et al., 5 Jan 2026).
- Personalized and Fair Learning: Multi-block splits and supplementary local heads enable transfer learning and personalized updates, ensuring high accuracy even for “thin” clients, with fairness guarantees on computation/workload allocation (Wadhwa et al., 2023, Yuan et al., 14 Aug 2025).
Token-fusion strategies across multimodal robots and collaborative robots in factories further exploit split points and resource-aware aggregation for robust, scalable, and low-latency training in industrial IoT (Ni et al., 7 Oct 2025).
5. Extensions: Auxiliary Losses, Decentralization, and Beyond
Communication/storage efficient FSL variants employ auxiliary networks (“heads”) at client cut layers to approximate the server loss, allowing for less frequent activation uploads (e.g., every mini-batches), with a single server model to eliminate server memory growth (Mu et al., 2023, Mu et al., 21 Jul 2025). Formal convergence guarantees hold under mild assumptions (e.g., for non-convex loss).
Super-network strategies (SuperSFL) sample client-specific subnetworks from a global weight-sharing backbone, dynamically fit to device-specific memory/latency profiles, fusing local and server gradients with depth- and loss-weighted aggregation, with up to communication reduction and $2$– round acceleration over baseline SFL (Asif et al., 5 Jan 2026).
Decentralized FSL on permissioned blockchains orchestrates split-training and FedAvg via transient fields and private data collections, fully removing central coordinators while maintaining near-centralized accuracy and scalable throughput (e.g., on CIFAR-10, clients) (Penedo et al., 10 Jul 2025).
FSL has also been successfully adapted for distributed sequential (RNN) learning over partitioned data, multimodal fusion, and privacy attacks/defenses (gradient inversion mitigation via zeroth-order optimization (Shi et al., 2024), local/cut-layer DP, and PixelDP techniques).
6. Experimental Performance and Practical Guidance
Empirical studies across real-world domains and simulated networks report:
- UCI HAR FSL: peak accuracy (with DP at ), higher than FL, and round time reduction (Ndeko et al., 2024).
- SCALA: Robust to extreme label skew ( classes/client), e.g., (CIFAR-10), (CINIC-10), (CIFAR-100), outperforming FL/FedAvg/FedProx/Dyn baselines by up to $20$ points (Yang et al., 2024).
- GSFL: end-to-end latency reduction at matched accuracy versus vanilla FL (Zhang et al., 2023).
- Storage/comms scaling: $5$– bandwidth and $50$– server storage reduction (CSE-FSL, h=5–25) with $1$– accuracy degradation (Mu et al., 2023, Mu et al., 21 Jul 2025).
- Decentralized FSL: Near-parity accuracy to centralized FSL, compressed epoch times (e.g., $30$ min vs $85$ min Ethereum-SL for CIFAR-10), and scalable ledger/network performance with stable latency up to $25$ clients (Penedo et al., 10 Jul 2025).
Table: Representative Communication Savings (CSE-FSL, CIFAR-10) | Method | Accuracy | Comm (GB) | Server Storage | |---------------|----------|-----------|---------------| | FSL_MC | 80.6% | 172.5 | 5.3M params | | CSE-FSL (h=5) | 76.5% | 18.1 | 1.6M params |
7. Open Directions and Limitations
Emerging challenges and future lines of investigation for FSL include:
- Optimal, possibly adaptive, cut-layer selection using online learning or reinforcement learning agents (Lee et al., 2023).
- Secure aggregation and robust DP mechanisms at both activations and server gradients.
- Privacy-preserving extensions integrating with blockchain and secure multi-party protocols (Penedo et al., 10 Jul 2025).
- Scaling to arbitrarily heterogeneous and intermittent clients by dynamic, resource-driven supernet partitioning (Asif et al., 5 Jan 2026).
- Automated token fusion and split design in multimodal and hierarchical edge-cloud systems (Ni et al., 7 Oct 2025).
- Trade-offs between accuracy, privacy resilience (activation invertibility), computation/energy, and communication remain an active frontier.
FSL represents a convergent paradigm that delivers substantial efficiency, scalability, and privacy gains over both FL and SL, with rich ongoing development spanning architecture, privacy, resource-aware deployment, and theoretical guarantees (Ndeko et al., 2024, Yang et al., 2024, Asif et al., 5 Jan 2026, Penedo et al., 10 Jul 2025).