Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 218 tok/s Pro
2000 character limit reached

SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation (2507.08508v1)

Published 11 Jul 2025 in cs.LG

Abstract: Federated Learning (FL) is a distributed machine learning paradigm which coordinates multiple clients to collaboratively train a global model via a central server. Sequential Federated Learning (SFL) is a newly-emerging FL training framework where the global model is trained in a sequential manner across clients. Since SFL can provide strong convergence guarantees under data heterogeneity, it has attracted significant research attention in recent years. However, experiments show that SFL suffers from severe catastrophic forgetting in heterogeneous environments, meaning that the model tends to forget knowledge learned from previous clients. To address this issue, we propose an SFL framework with discrepancy-aware multi-teacher knowledge distillation, called SFedKD, which selects multiple models from the previous round to guide the current round of training. In SFedKD, we extend the single-teacher Decoupled Knowledge Distillation approach to our multi-teacher setting and assign distinct weights to teachers' target-class and non-target-class knowledge based on the class distributional discrepancy between teacher and student data. Through this fine-grained weighting strategy, SFedKD can enhance model training efficacy while mitigating catastrophic forgetting. Additionally, to prevent knowledge dilution, we eliminate redundant teachers for the knowledge distillation and formalize it as a variant of the maximum coverage problem. Based on the greedy strategy, we design a complementary-based teacher selection mechanism to ensure that the selected teachers achieve comprehensive knowledge space coverage while reducing communication and computational costs. Extensive experiments show that SFedKD effectively overcomes catastrophic forgetting in SFL and outperforms state-of-the-art FL methods.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces SFedKD, a novel sequential federated learning approach that mitigates catastrophic forgetting by leveraging discrepancy-aware multi-teacher knowledge distillation.
  • It employs a teacher selection mechanism based on class distribution discrepancies, extending decoupled knowledge distillation to a multi-teacher framework.
  • Experiments on diverse datasets demonstrate that SFedKD outperforms current FL methods in accuracy and stability under high data heterogeneity.

SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation

Introduction

This paper addresses the challenge of catastrophic forgetting in Sequential Federated Learning (SFL) due to data heterogeneity across clients. Traditional Federated Learning (FL) concatenates client models to create a global model, but in SFL, models train sequentially across clients, which helps with convergence but often leads to forgetting previous client-specific knowledge. SFedKD introduces a methodology to counteract this forgetting by employing discrepancy-aware multi-teacher knowledge distillation. This approach involves selecting multiple teacher models from prior rounds that cover the knowledge space, applying distinct weights for each model's contributions based on local class distribution discrepancies, and extending single-teacher Decoupled Knowledge Distillation (DKD) approaches to the multi-teacher setting. Additionally, the paper proposes a complementary-based teacher selection mechanism to optimize teacher selection for comprehensive coverage and efficiency.

Problem and Framework

The proposed framework operates over several rounds. In each round, a sequence of clients is sampled, and the server applies a selection mechanism to determine which teacher models from the previous round will guide the current training process. The teacher models are chosen based on how well they complement each other to provide comprehensive knowledge across classes. This framework trains models sequentially and utilizes a discrepancy-aware weighting scheme for knowledge distillation. The overarching goal is to balance new learning and retention of prior knowledge, reducing catastrophic forgetting. Figure 1

Figure 1: The overview of our proposed SFedKD, illustrating the sequential training and teacher selection process with discrepancy-aware multi-teacher knowledge distillation.

Discrepancy-Aware Multi-Teacher KD

The paper extends DKD to accommodate multiple teacher models. SFedKD's novelty lies in the assignment of distinct weights to teachers' target and non-target class knowledge, allowing precise alignment of knowledge distillation to overcome catastrophic forgetting. Teacher model contributions during distillation are weighted based on the discrepancy between their local class distributions and the current client’s dataset distribution. This fine-grained approach ensures personalized knowledge assimilation while keeping the model consistently effective across diverse classes.

Teacher Selection Mechanism

The selection mechanism identifies redundant teachers—those whose knowledge overlaps excessively. By maximizing knowledge space coverage—the complementarity of teacher selections—SFedKD ensures diverse, yet relevant, distillation guidance. This approach is framed as a variant of the maximum coverage problem, solved via a greedy algorithm. The selected teachers provide diverse class coverage and avoid diluting distilled information, leading to efficient training with reduced computational overhead.

Experimental Findings

Experiments were conducted across multiple datasets (Fashion-MNIST, CIFAR-10, CINIC-10, CIFAR-100, and HAM10000). SFedKD consistently outperformed state-of-the-art FL methodologies. The discrepancy-aware multi-teacher KD effectively mitigated catastrophic forgetting, showing superior accuracy, especially under conditions with high data heterogeneity. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Class-wise accuracy demonstrating SFedKD's efficacy in maintaining consistent performance across diverse classes.

Figure 3

Figure 3

Figure 3: Impact of hyperparameter γ\gamma exploring SFedKD's sensitivity to weighting balance.

Conclusion

SFedKD presents a robust approach to mitigating catastrophic forgetting in sequential federated learning through innovative use of multi-teacher knowledge distillation. By leveraging class distribution discrepancies for fine-grained teacher guidance, and optimizing teacher selection for knowledge diversity, SFedKD significantly enhances model performance stability and efficacy across heterogeneous client datasets. The promising results indicate its potential for more resilient FL systems that maintain past knowledge while adapting to new client environments.

References

  • McMahan, B., et al. Communication-efficient learning of deep networks from decentralized data.
  • Zhao, B., et al. Decoupled knowledge distillation.
  • Kwon, K., et al. Adaptive knowledge distillation based on entropy.
  • Li, T., et al. Federated optimization in heterogeneous networks.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube