Sequential Federated Learning (SFL) Overview

Updated 9 November 2025

Sequential Federated Learning is a paradigm where a global model is updated sequentially by distributed clients with heterogeneous data.
It employs multi-teacher knowledge distillation and complementary teacher selection to counteract catastrophic forgetting and knowledge dilution.
SFL balances communication and computation trade-offs, enabling privacy-preserving, on-device continual learning and effective collaborative model training.

Sequential Federated Learning (SFL) is a federated learning paradigm in which a global model is trained in a strictly sequential manner across a set of distributed clients. Unlike the synchronous aggregation protocol of conventional Federated Averaging, SFL updates the global model one client at a time, passing the model through a sequence of clients, each with locally heterogeneous data. This design confers distinctive convergence properties under data heterogeneity but introduces challenges such as catastrophic forgetting, knowledge dilution, and nontrivial communication-computation trade-offs. Recent advances, notably SFedKD (Xu et al., 11 Jul 2025), have addressed these issues by integrating discrepancy-aware multi-teacher knowledge distillation and complementary-based teacher selection mechanisms. Below is a comprehensive treatment of the theoretical underpinnings, algorithmic structures, practical trade-offs, and empirical outcomes of contemporary SFL approaches.

1. Fundamental Principles and Model Definition

In the SFL framework, let $P = \{\pi_1, \pi_2, ..., \pi_M\}$ denote the set of $M$ clients. The global model $G^{(t)}$ at round $t$ is initialized and then transmitted sequentially through an ordered subset or permutation of the clients. At client $\pi_i$ , the model is locally updated using private data $D_{\pi_i}$ , then passed to the next client. In each round, the global model sees local updates serially rather than the simultaneous updates of Federated Averaging.

Mathematically, if $G^{(t,0)}$ is the global model at the start of round $t$ , the update sequence proceeds as:

$G^{(t,i)} = \text{Update}_i\left(G^{(t,i-1)}, D_{\pi_i}\right)\,, \quad i=1,2,...,M\,, \quad G^{(t+1,0)} := G^{(t,M)}$

where $\text{Update}_i$ can involve SGD, local fine-tuning, or a more sophisticated learner. This approach is fundamentally sensitive to the order and client heterogeneity, often resulting in catastrophic forgetting as local updates can overwrite previously acquired knowledge.

2. Catastrophic Forgetting, Knowledge Dilution, and SFL Limitations

Catastrophic forgetting in SFL arises since each client update is performed in isolation with non-i.i.d. data, causing recently acquired knowledge to dominate and potentially erase or degrade the global model’s performance on previously seen distributions (Xu et al., 11 Jul 2025). Knowledge dilution further manifests when the model cycles through a sequence of clients with highly skewed local distributions, leading to non-optimal convergence and subpar generalization.

Table 1: Core Problems in SFL

Limitation	Cause	Manifestation
Catastrophic Forgetting	Local update overwrites	Loss of initial knowledge
Knowledge Dilution	Non-i.i.d. data, order sensitivity	Weak generalization
Communication Burden	Full-model transmission	Scalability bottlenecks

These limitations motivate the integration of knowledge distillation and redundancy-aware teacher selection to retain and complement knowledge across client updates.

3. Knowledge Distillation in SFL: Multi-Teacher and Discrepancy-Aware Approaches

Recent work, notably SFedKD (Xu et al., 11 Jul 2025), extends single-teacher Decoupled Knowledge Distillation protocols to incorporate multiple teachers and adapt the knowledge transfer to the local class-distribution discrepancies. At each client update, rather than relying solely on the immediately preceding global model as a teacher, the current client leverages a set of teacher models from previous rounds or selected from a pool, where each teacher’s influence is modulated by the similarity (or discrepancy) between its class distribution and that of the local client.

Let $T$ denote the selected set of $K$ teacher models, each with class-distribution vector $D_\pi \in \mathbb{R}^C$ , $C$ being the number of classes. The knowledge distillation loss for the student is formed by weighting the soft logits from each teacher with a two-part weighting:

Target-class weight: Favors teachers that overlap in class-support with the current client.
Non-target-class weight: Downweighted if the teacher’s class support is disjoint or highly unbalanced relative to the student’s local data.

Distinct weighting schemes are parameterized by the statistical distance between the class distributions (e.g., KL-divergence, Total Variation). This enables the aggregation of knowledge from a diverse set of teachers without disproportionately introducing bias from irrelevant or redundant sources.

4. Complementary-Based Teacher Selection via Maximum Coverage

To prevent knowledge dilution and explicit redundancy, SFedKD formalizes multi-teacher selection as a variant of the maximum coverage problem:

Objective: Select $K$ out of $M$ candidate teachers such that the aggregate class distribution $D_{\mathrm{agg}}(T) = \sum_{\pi \in T} D_\pi$ is as close as possible to the uniform class distribution $U = (1/C,\ldots,1/C)$ .

The optimization is:

$T^* = \arg\min_{T \subseteq P,\;|T|=K} d(D_{\mathrm{agg}}(T), U)$

with $d(\cdot, \cdot)$ being an appropriate metric (e.g., $\ell_1$ norm, KL divergence).

A standard greedy algorithm is used: sequentially select the teacher whose inclusion most reduces $d(D_\mathrm{agg}, U)$ , until $K$ teachers are chosen. The greedy selection provides a $(1-1/e)$ -approximation when the set function is submodular, as is the case for classical maximum coverage in the one-hot (single-class) scenario.

Pseudocode Sketch: Greedy Teacher Selection

Initialize T = ∅, D_agg = 0
For k = 1 to K:
    Select π* = argmin_{π ∈ P \ T} d(D_agg + D_π, U)
    T ← T ∪ {π*}
    D_agg ← D_agg + D_π*
Return T

This method minimizes redundant knowledge transfer and achieves broad class coverage with minimal number of teachers, optimizing both communication and computation.

5. Trade-Offs: Coverage vs. Cost and Computational Complexity

Selecting $K \ll M$ complementary teachers reduces server-to-client transmission (since teacher logits and models are fetched only from the $K$ selected teachers) and lowers on-device computation due to fewer required forward passes for distillation. Each greedy iteration requires $O(MC)$ computations per step, yielding $O(KMC)$ total complexity per round, with storage scaling as $O(MC)$ .

Greedy selection’s approximation quality depends on the submodularity of the set function; in practice, near-optimal class coverage is achievable with substantially lower resource costs than naively using all available past models as teachers.

Table 2: Complexity and Approximation

Method	Time Complexity	Communication	Approximation
Greedy Selection	$O(KMC)$	$O(K)$	$(1-1/e)$ -optimal (in special cases)

A plausible implication is that further gains could be realized by integrating communication-aware selection heuristics, especially in bandwidth-constrained deployments.

6. Empirical Performance and SFL Application Domains

Extensive experiments demonstrate that discrepancy-aware multi-teacher distillation with complementary-based teacher selection in SFL (as in SFedKD (Xu et al., 11 Jul 2025)) alleviates catastrophic forgetting and achieves superior accuracy in the presence of highly heterogeneous data. Representative results show:

Significant gains over state-of-the-art FL methods in benchmark settings.
Near-uniform knowledge coverage with a fraction of teacher models.
Effective mitigation of the trade-off between distillation efficacy and system overhead.

Application domains include privacy-preserving medical learning, on-device continual learning, and regulatory-constrained multi-institution model sharing.

Teacher selection and knowledge distillation in SFL are closely related to broader multi-teacher selection strategies in distributed and reinforcement learning contexts. For example, information-theoretic teacher selection methods such as GRACE (Panigrahi et al., 4 Nov 2025) quantify alignment and diversity among candidates, offering scalable heuristics for distillation meta-selection. In reinforcement learning from human feedback, complementary teacher selection is operationalized through Bayesian/POMDP planning, leveraging teacher diversity as measured by KL divergence of feedback likelihoods or posterior entropy reduction (Freedman et al., 2023).

A plausible implication is that future SFL designs could integrate gradient-based or information-theoretic teacher evaluation as practical surrogates for class-distribution-based complementary metrics.

In summary, Sequential Federated Learning advances communication-efficient, privacy-preserving collaborative model training under heterogeneous data, with state-of-the-art protocols emphasizing discrepancy-aware, complementary multi-teacher distillation and submodular maximization strategies for teacher selection. Ongoing research focuses on further improving scalability, robustness to adversarial or drifting distributions, and adaptive teacher selection frameworks.