SuperSFL: Heterogeneous Federated Split Learning

Updated 12 January 2026

SuperSFL is a federated split learning framework that uses weight-sharing super-networks and resource-aware subnetwork generation to handle device heterogeneity.
It introduces a Three-Phase Gradient Fusion mechanism that optimizes model updates by combining client and server contributions under varying network conditions.
Fault-tolerant aggregation and performance-weighted updates ensure robust training and efficiency in dynamic, resource-diverse edge environments.

SuperSFL refers to “Resource-Heterogeneous Federated Split Learning with Weight-Sharing Super-Networks,” a framework that addresses the challenge of efficient, robust training in federated edge environments composed of clients with highly diverse computational and network capabilities. Building on existing SplitFed Learning paradigms, SuperSFL introduces a weight-sharing super-network architecture, a Three-Phase Gradient Fusion (TPGF) optimization mechanism, and fault-tolerant aggregation for resilient, high-throughput distributed learning (Asif et al., 5 Jan 2026).

1. Background: SplitFed Learning and Device Heterogeneity

SplitFed Learning (SFL) integrates the architectural principles of Federated Learning (FL) and Split Learning (SL). In SFL, the global model is partitioned at a designated "cut" point such that each client operates a shallow encoder (hosting the initial model layers), while a centralized server hosts the deeper layers (decoder). Clients execute forward passes up to the cut point, transmit "smashed data" activations to the server, which completes the forward and backward computation, and returns the cut-layer gradients. This structure reduces per-client computation and bandwidth demands compared to FL and SL individually.

A major limitation in practical deployments is device heterogeneity, where client resources (CPU/GPU, memory, and network latency) vary significantly. Prior SFL approaches assume a uniform client split depth, which is either computationally excessive for weak devices or suboptimal for strong ones, and further lack fault tolerance, halting training during client or server disruption (Asif et al., 5 Jan 2026).

SuperSFL centralizes the model as a "super-network," $\Theta = \{\theta_1, \theta_2, \ldots, \theta_L\}$ , where $L$ is the full model depth. Each client $i$ is dynamically allocated a structurally compatible subnetwork defined by a contiguous prefix of layers $\theta_i = \{\theta_1, \ldots, \theta_{d_i}\}$ , where the depth $d_i$ is determined by its resource profile.

At initialization, each client reports memory $m_i$ and communication latency $lat_i$ . The server computes subnetwork depth via

$d_i = \min \left( \left\lfloor \alpha m_i \right\rfloor + \left\lfloor \beta \frac{lat_{max}-lat_i}{lat_{max}-lat_{min} + \epsilon} \right\rfloor,\, L-1 \right),\quad d_i \ge 1$

with $\alpha = 0.5$ layers/GB, $\beta = 4$ , and $\epsilon = 10^{-6}$ . This direct mapping produces resource-aware splits that optimize per-client throughput. Each client receives model slice $\theta_i$ and partakes in training with minimal redundancy or underutilization (Asif et al., 5 Jan 2026).

3. Three-Phase Gradient Fusion (TPGF): Optimization under Heterogeneous Splits

TPGF is introduced to mediate the update of client models informed by distinct pathway losses and supervised signals across the client–server boundary:

Phase 1: Local Supervision
- Client forward-passes $z_i^c = f_{\theta_i}(x_i)$ .
- Local classifier $h_{\phi_i}$ predicts $\hat{y}_i$ and computes loss $L_{client}$ .
- Client updates classifier parameters and computes/clip the encoder gradient $g_{client}$ .
Phase 2: Server Supervision
- Client sends $z_i^c$ to the server.
- Server computes $L_{server}$ , updates its parameters, and returns gradient $g_z$ .
- Client back-propagates $g_z$ to acquire $g_{server}$ .
Phase 3: Loss-Weighted Gradient Fusion
- Weights are assigned based on both subnetwork depth and loss:
$w_{client} = \frac{d_i}{d_i + d_s} \cdot \frac{(\mathcal{L}_{client}+\epsilon)^{-1}}{(\mathcal{L}_{client}+\epsilon)^{-1} + (\mathcal{L}_{server}+\epsilon)^{-1}}$

Fused gradient: $\nabla_{\theta_i} = w_{client} g_{client} + (1-w_{client}) g_{server}$
Parameters $\theta_i$ updated via fused gradient.

Each iteration executes these phases, permitting fully parallel client operation and preserving communication efficiency. No added protocol overhead is introduced beyond smashed data and its gradient (Asif et al., 5 Jan 2026).

4. Fault Tolerance and Collaborative Aggregation

To maintain progress during network outages or server unavailability, each client possesses a lightweight local classifier $h_{\phi_i}$ . If the server does not respond to $z_i^c$ within 5 seconds, the client enters a fallback mode and continues with local-only updates as per Phase 1. Upon server re-availability, the client resynchronizes with the global state.

Model aggregation leverages a composite weighting over client depth and performance:

$w_i = \frac{d_i}{\sum_j d_j} \cdot \frac{(\mathcal{L}_{client}^i+\epsilon)^{-1}}{\sum_j (\mathcal{L}_{client}^j+\epsilon)^{-1}}$

Layer-wise averaging with server consistency is conducted via the convex objective:

$\bar\theta^\ell = \frac{\sum_i w_i \theta_i^\ell + \lambda \theta_s^\ell}{\sum_i w_i + \lambda}, \quad \lambda=0.01$

Classifier parameters $\phi_i$ remain local, never subjected to global averaging (Asif et al., 5 Jan 2026).

5. Empirical Evaluation: Convergence, Efficiency, and Robustness

Experiments utilize a Vision Transformer (ViT-16) backbone on CIFAR-10 and CIFAR-100, under heterogeneous client memory ( $[2,16]$ GB) and latency ( $[20,200]$ ms) distributions, with non-IID Dirichlet partitioning. Key results are summarized:

Dataset	#Clients	Target Acc.	Rounds (SFL/DFL/SSFL)	Comm. (MB) (SFL/DFL/SSFL)	Time (s) (SFL/DFL/SSFL)
CIFAR-10	50	70%	11 / 9 / 5	9075 / 2305 / 466	6127 / 2650 / 595
CIFAR-10	100	75%	19 / 16 / 12	21463 / 15472 / 939	12168 / 14368 / 1010
CIFAR-100	50	75%	35 / 27 / 15	28938 / 7909 / 7194	21284 / 9796 / 8766
CIFAR-100	100	80%	100 / 34 / 22	165358 / 13638 / 9719	114955 / 15328 / 8926

SuperSFL achieves 2–5× faster convergence, up to 20× reduced communication cost, and up to 13× shorter training time compared to baseline SFL (Asif et al., 5 Jan 2026).

Energy and carbon footprint metrics indicate higher efficiency at superior accuracy:

Dataset	Clients	Method	Acc.(%)	Avg. Power (W)	Power/Acc (W/%)	CO₂ (g)
CIFAR-10	50	SFL	78.84	1165	14.78	466.19
		DFL	70.15	362	5.17	144.88
		SSFL	96.93	493	5.09	197.17
CIFAR-100	100	SSFL	87.48	1539	17.60	615.52

Ablation studies demonstrate that full TPGF is necessary for optimal convergence (96.93% accuracy vs. 85.89% with equal fusion). Server-availability ablations show that SuperSFL retains >89% accuracy with only 10% server-supervised rounds and converges (86.36%) even with no server involvement (Asif et al., 5 Jan 2026).

SuperSFL extends previous SplitFed and federated optimization methods by embedding explicit architectural support for resource heterogeneity and communication disruption. The weight-sharing super-network and TPGF mechanisms generalize model personalization while preserving strict structural compatibility across clients. The composite aggregation scheme integrates both structure- and performance-awareness, which is absent in classical FedAvg and SplitFed paradigms.

A plausible implication is that SuperSFL reduces barriers to real-world federated deployments in IoT and mobile environments, particularly for computer vision tasks on resource-diverse clients. Its fallback mode and performance-weighted aggregation offer robustness against network volatility and client-side interruptions, supporting prolonged, stable training under non-ideal real-world conditions (Asif et al., 5 Jan 2026).

7. Summary and Outlook

SuperSFL provides a principled approach to federated split learning that systematically addresses client heterogeneity, network unreliability, and efficiency constraints. Its adoption of a weight-sharing super-network with resource-dependent subnetwork allocation, coupled with Three-Phase Gradient Fusion and client-server collaborative aggregation, yields state-of-the-art performance and robustness benchmarks. These advances position SuperSFL as a scalable and practical foundation for federated learning applications in highly variable edge computing ecosystems (Asif et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SuperSFL: Resource-Heterogeneous Federated Split Learning with Weight-Sharing Super-Networks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperSFL.