Papers
Topics
Authors
Recent
Search
2000 character limit reached

FedKD: Efficient Federated Distillation

Updated 25 November 2025
  • FedKD is a federated learning framework that leverages local teacher models and a shared student model to enable adaptive mutual distillation for efficient learning.
  • It integrates dynamic SVD-based gradient compression to significantly reduce the uplink overhead, achieving up to 10× lower communication costs compared to conventional methods.
  • FedKD maintains robust model performance on benchmarks like MIND and SMM4H while ensuring privacy and scalability in cross-device federated settings.

The FedKD framework is a knowledge-distillation-based federated learning paradigm designed to address the prohibitive communication overhead characteristic of standard parameter-averaging protocols in cross-device settings. FedKD couples an adaptive mutual distillation process between client-local teacher models and a shared student model with dynamic SVD-based gradient compression to maximize communication efficiency while preserving privacy and accuracy. The following sections delineate the foundational concepts, operational mechanisms, theoretical underpinnings, empirical results, comparative context, and practical implications of FedKD as established in Wu et al. (Wu et al., 2021) and contextualized by subsequent surveys (Li et al., 2024).

1. Framework Overview and Problem Formulation

FedKD addresses federated learning with NN clients, each holding a private dataset DiD_i (of size nin_i), and a central server coordinating RR communication rounds. Each client ii maintains two models: a large, non-communicated teacher TiT_i (Θit\Theta_i^t), optimized only on DiD_i, and a smaller, shared student SS (Θs\Theta^s), which is the only model exchanged and aggregated globally. The protocol proceeds as follows for each round DiD_i0:

  1. The server broadcasts DiD_i1 to selected clients.
  2. Each client initializes its local student, trains teacher and student on DiD_i2, computes local student gradient DiD_i3.
  3. Clients compress DiD_i4 and upload to the server.
  4. The server aggregates DiD_i5 and updates DiD_i6.
  5. Repeat until convergence.

The objective is to leverage local data to jointly learn high-quality teachers (for deployment) and an efficient global student, all while minimizing uplink/downlink bandwidth, enforcing no data leakage, and decoupling convergence from the parameter count of the large teacher networks.

2. Adaptive Mutual Distillation Mechanism

Mutual knowledge transfer between teacher and student occurs via jointly optimized loss functions computed on each client. Denoting DiD_i7 and DiD_i8 as the teacher and student softmax outputs, and DiD_i9 as the ground-truth one-hot label:

  • Task loss (cross-entropy):

nin_i0

  • Distillation loss (KL with adaptive reliability): A reliability weight is computed as nin_i1,

nin_i2

  • Hidden-state and attention distillation: Denoting nin_i3 and nin_i4 as hidden states and attention maps of teacher and student, and nin_i5 as a dimension-matching adaptor,

nin_i6

  • Total losses:

nin_i7

nin_i8

Gradients nin_i9 are used for local teacher update, while RR0 undergoes compression before being transmitted to the server. The reliability scaling ensures distillation is suppressed when predictions are unreliable.

3. Communication-Efficient Gradient Compression

FedKD utilizes a truncated singular value decomposition (SVD) to compress student gradients prior to upload:

  • For gradient matrix RR1, compute the truncated SVD RR2 with RR3.
  • The rank RR4 is chosen adaptively to capture an energy threshold RR5, which is itself dynamically scheduled as RR6, tightening approximation as training progresses.
  • The server reconstitutes gradients from received SVD factors and aggregates as above.
  • Communication cost per client per round is reduced from RR7 to RR8 scalars, yielding a typical compression ratio RR9.

This approach achieves an order-of-magnitude reduction in communication compared to uncompressed parameter or gradient exchange.

4. Algorithmic Pseudocode and Workflow

Below is a high-level procedural pseudocode, matching the formal protocol (Wu et al., 2021):

TiT_i9

5. Communication-Cost Analysis

On practical benchmarks (MIND, SMM4H), the cost per client per round for FedAvg on full teacher models is 2.05 GB (MIND) and 1.37 GB (SMM4H). With FedKD and a student model, FedKDii0 achieves 0.19 GB (MIND; %%%%41DiD_i042%%%% reduction) and 0.12 GB (SMM4H; %%%%43DiD_i144%%%% reduction), while FedKDii5 further reduces costs. Theoretical scaling is ii6 since ii7 and ii8.

Method MIND (AUC, COMM) SMM4H (F1, COMM)
FedAvg (teacher) 70.9, 2.05 GB 60.6, 1.37 GB
FedKDii9 71.0, 0.19 GB 60.7, 0.12 GB
FedKDTiT_i0 70.5, 0.11 GB 59.8, 0.07 GB

FedKD matches or slightly exceeds baseline accuracy at drastic communication savings.

6. Empirical Evaluation and Baselines

FedKD is benchmarked on MIND (large-scale news recommendation; metrics: AUC, MRR, nDCGs) and SMM4H (binary ADR tweet detection; metrics: Precision, Recall, F1). Compared baselines include centralized and local learning, distilled models (DistilBERT, TinyBERT), FetchSGD, FedDropout.

On MIND, FedKDTiT_i1 matches centralized UniLM (AUC=71.0) with %%%%52DiD_i053%%%% lower bandwidth; on SMM4H, FedKDTiT_i4 achieves F1=60.7 vs. F1=60.6 (FedAvg), again with %%%%55DiD_i056%%%% efficiency. Smaller student variants (FedKDTiT_i7) demonstrate further savings with minimal degradation (TiT_i8 pp AUC on MIND).

Experiments confirm that adaptive mutual distillation and dynamic compression allow FedKD to operate at a fraction of the communication cost without sacrificing model quality, robustly outperforming FedAvg and other state-of-the-art baselines under equivalent resource constraints.

7. Position within the Federated Distillation Landscape

FedKD is situated as a leading communication-efficient solution in the federated distillation (FD) taxonomy, as surveyed in (Li et al., 2024). It uniquely integrates a client-local teacher, a single shared student proxy, SVD-compressed gradient transfer, and adaptive distillation scaling, supporting model heterogeneity and robust privacy—qualities lacking in naive soft-output exchange or full-parameter aggregation. Its two-network per-client framework incurs local compute overhead but empirically yields near-optimal convergence and strong final model performance.

FedKD's methodology contrasts with parameter-averaging protocols (FedAvg), co-distillation, and other FD variants, providing a systematically evaluated trade-off between uplink cost, model accuracy, and hardware flexibility, and is extensible toward adaptive compression, privacy enhancement, and complex tasks beyond standard classification (Wu et al., 2021, Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FedKD Framework.