Papers
Topics
Authors
Recent
2000 character limit reached

FedKD: Efficient Federated Distillation

Updated 25 November 2025
  • FedKD is a federated learning framework that leverages local teacher models and a shared student model to enable adaptive mutual distillation for efficient learning.
  • It integrates dynamic SVD-based gradient compression to significantly reduce the uplink overhead, achieving up to 10× lower communication costs compared to conventional methods.
  • FedKD maintains robust model performance on benchmarks like MIND and SMM4H while ensuring privacy and scalability in cross-device federated settings.

The FedKD framework is a knowledge-distillation-based federated learning paradigm designed to address the prohibitive communication overhead characteristic of standard parameter-averaging protocols in cross-device settings. FedKD couples an adaptive mutual distillation process between client-local teacher models and a shared student model with dynamic SVD-based gradient compression to maximize communication efficiency while preserving privacy and accuracy. The following sections delineate the foundational concepts, operational mechanisms, theoretical underpinnings, empirical results, comparative context, and practical implications of FedKD as established in Wu et al. (Wu et al., 2021) and contextualized by subsequent surveys (Li et al., 2 Apr 2024).

1. Framework Overview and Problem Formulation

FedKD addresses federated learning with NN clients, each holding a private dataset DiD_i (of size nin_i), and a central server coordinating RR communication rounds. Each client ii maintains two models: a large, non-communicated teacher TiT_i (Θit\Theta_i^t), optimized only on DiD_i, and a smaller, shared student SS (Θs\Theta^s), which is the only model exchanged and aggregated globally. The protocol proceeds as follows for each round rr:

  1. The server broadcasts Θs(r1)\Theta^s(r-1) to selected clients.
  2. Each client initializes its local student, trains teacher and student on DiD_i, computes local student gradient GisG_i^s.
  3. Clients compress GisG_i^s and upload to the server.
  4. The server aggregates Gs(r)=1Ni=1NGisG^s(r) = \frac{1}{N}\sum_{i=1}^N G_i^s and updates Θs(r)\Theta^s(r).
  5. Repeat until convergence.

The objective is to leverage local data to jointly learn high-quality teachers (for deployment) and an efficient global student, all while minimizing uplink/downlink bandwidth, enforcing no data leakage, and decoupling convergence from the parameter count of the large teacher networks.

2. Adaptive Mutual Distillation Mechanism

Mutual knowledge transfer between teacher and student occurs via jointly optimized loss functions computed on each client. Denoting yit=σ(f(Θit;x))y_i^t = \sigma(f(\Theta_i^t; x)) and yis=σ(f(Θs;x))y_i^s = \sigma(f(\Theta^s; x)) as the teacher and student softmax outputs, and yy as the ground-truth one-hot label:

  • Task loss (cross-entropy):

Lt,itask=CE(y,yit),Ls,itask=CE(y,yis)L_{t,i}^{task} = CE(y, y_i^t), \qquad L_{s,i}^{task} = CE(y, y_i^s)

  • Distillation loss (KL with adaptive reliability): A reliability weight is computed as wi=1/(Lt,itask+Ls,itask)w_i = 1/(L_{t,i}^{task} + L_{s,i}^{task}),

Lt,idist=wiKL(yisyit),Ls,idist=wiKL(yityis)L_{t,i}^{dist} = w_i \cdot KL(y_i^s \| y_i^t), \qquad L_{s,i}^{dist} = w_i \cdot KL(y_i^t \| y_i^s)

  • Hidden-state and attention distillation: Denoting Hit,AitH_i^t, A_i^t and Hs,AsH^s, A^s as hidden states and attention maps of teacher and student, and WihW_i^h as a dimension-matching adaptor,

Lt,ihid=Ls,ihid=wi[MSE(Hit,WihHs)+MSE(Ait,As)]L_{t,i}^{hid} = L_{s,i}^{hid} = w_i \cdot [MSE(H_i^t, W_i^h H^s) + MSE(A_i^t, A^s)]

  • Total losses:

Lt,i=Lt,itask+Lt,idist+Lt,ihidL_{t,i} = L_{t,i}^{task} + L_{t,i}^{dist} + L_{t,i}^{hid}

Ls,i=Ls,itask+Ls,idist+Ls,ihidL_{s,i} = L_{s,i}^{task} + L_{s,i}^{dist} + L_{s,i}^{hid}

Gradients Git=Lt,i/ΘitG_i^t = \partial L_{t,i}/\partial \Theta_i^t are used for local teacher update, while Gis=Ls,i/ΘsG_i^s = \partial L_{s,i}/\partial \Theta^s undergoes compression before being transmitted to the server. The reliability scaling ensures distillation is suppressed when predictions are unreliable.

3. Communication-Efficient Gradient Compression

FedKD utilizes a truncated singular value decomposition (SVD) to compress student gradients prior to upload:

  • For gradient matrix GisRP×QG_i^s \in \mathbb{R}^{P \times Q}, compute the truncated SVD GisUiΣiViG_i^s \approx U_i \Sigma_i V_i^\top with Kmin(P,Q)K \ll \min(P, Q).
  • The rank KK is chosen adaptively to capture an energy threshold T(t)T(t), which is itself dynamically scheduled as T(t)=Tstart+(TendTstart)(r/R)T(t) = T_{start} + (T_{end} - T_{start}) \cdot (r/R), tightening approximation as training progresses.
  • The server reconstitutes gradients from received SVD factors and aggregates as above.
  • Communication cost per client per round is reduced from PQP \cdot Q to K(P+Q+1)K \cdot (P + Q + 1) scalars, yielding a typical compression ratio ρ(PQ)/(K(P+Q+1))\rho \approx (P \cdot Q)/(K \cdot (P + Q + 1)).

This approach achieves an order-of-magnitude reduction in communication compared to uncompressed parameter or gradient exchange.

4. Algorithmic Pseudocode and Workflow

Below is a high-level procedural pseudocode, matching the formal protocol (Wu et al., 2021):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Server: 
  initialize Θ^s(0)
  for r = 1 to R:
    broadcast Θ^s(r-1) to selected clients
    collect (U_i, Σ_i, V_i) from clients
    reconstruct G_i^s = U_i Σ_i V_i^⊤
    compute G^s = (1/N) ∑_i G_i^s
    (optional: SVD compress G^s for downlink)
    update Θ^s(r) = Θ^s(r-1) - η_s G^s

Client i:
  receive Θ^s(r-1)
  set local student Θ^s ← Θ^s(r-1)
  train teacher and student via L_{t,i}, L_{s,i}; update Θ_i^t and cache G_i^s
  compress G_i^s via SVD (threshold T(r)); send (U_i, Σ_i, V_i) to server
  receive global SVD-compressed G^s; update Θ^s ← Θ^s - η_s G^s

5. Communication-Cost Analysis

On practical benchmarks (MIND, SMM4H), the cost per client per round for FedAvg on full teacher models is 2.05 GB (MIND) and 1.37 GB (SMM4H). With FedKD and a student model, FedKD4_4 achieves 0.19 GB (MIND; %%%%41rr42%%%% reduction) and 0.12 GB (SMM4H; %%%%43Θs(r1)\Theta^s(r-1)44%%%% reduction), while FedKD2_2 further reduces costs. Theoretical scaling is O(RΘs/ρ)O(RΘt)O(R \cdot |\Theta^s| / \rho) \ll O(R \cdot |\Theta^t|) since ΘsΘt|\Theta^s| \ll |\Theta^t| and ρ>1\rho > 1.

Method MIND (AUC, COMM) SMM4H (F1, COMM)
FedAvg (teacher) 70.9, 2.05 GB 60.6, 1.37 GB
FedKD4_4 71.0, 0.19 GB 60.7, 0.12 GB
FedKD2_2 70.5, 0.11 GB 59.8, 0.07 GB

FedKD matches or slightly exceeds baseline accuracy at drastic communication savings.

6. Empirical Evaluation and Baselines

FedKD is benchmarked on MIND (large-scale news recommendation; metrics: AUC, MRR, nDCGs) and SMM4H (binary ADR tweet detection; metrics: Precision, Recall, F1). Compared baselines include centralized and local learning, distilled models (DistilBERT, TinyBERT), FetchSGD, FedDropout.

On MIND, FedKD4_4 matches centralized UniLM (AUC=71.0) with %%%%52rr53%%%% lower bandwidth; on SMM4H, FedKD4_4 achieves F1=60.7 vs. F1=60.6 (FedAvg), again with %%%%55rr56%%%% efficiency. Smaller student variants (FedKD2_2) demonstrate further savings with minimal degradation (0.4-0.4 pp AUC on MIND).

Experiments confirm that adaptive mutual distillation and dynamic compression allow FedKD to operate at a fraction of the communication cost without sacrificing model quality, robustly outperforming FedAvg and other state-of-the-art baselines under equivalent resource constraints.

7. Position within the Federated Distillation Landscape

FedKD is situated as a leading communication-efficient solution in the federated distillation (FD) taxonomy, as surveyed in (Li et al., 2 Apr 2024). It uniquely integrates a client-local teacher, a single shared student proxy, SVD-compressed gradient transfer, and adaptive distillation scaling, supporting model heterogeneity and robust privacy—qualities lacking in naive soft-output exchange or full-parameter aggregation. Its two-network per-client framework incurs local compute overhead but empirically yields near-optimal convergence and strong final model performance.

FedKD's methodology contrasts with parameter-averaging protocols (FedAvg), co-distillation, and other FD variants, providing a systematically evaluated trade-off between uplink cost, model accuracy, and hardware flexibility, and is extensible toward adaptive compression, privacy enhancement, and complex tasks beyond standard classification (Wu et al., 2021, Li et al., 2 Apr 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FedKD Framework.