FedKD: Efficient Federated Distillation

Updated 25 November 2025

FedKD is a federated learning framework that leverages local teacher models and a shared student model to enable adaptive mutual distillation for efficient learning.
It integrates dynamic SVD-based gradient compression to significantly reduce the uplink overhead, achieving up to 10× lower communication costs compared to conventional methods.
FedKD maintains robust model performance on benchmarks like MIND and SMM4H while ensuring privacy and scalability in cross-device federated settings.

The FedKD framework is a knowledge-distillation-based federated learning paradigm designed to address the prohibitive communication overhead characteristic of standard parameter-averaging protocols in cross-device settings. FedKD couples an adaptive mutual distillation process between client-local teacher models and a shared student model with dynamic SVD-based gradient compression to maximize communication efficiency while preserving privacy and accuracy. The following sections delineate the foundational concepts, operational mechanisms, theoretical underpinnings, empirical results, comparative context, and practical implications of FedKD as established in Wu et al. (Wu et al., 2021) and contextualized by subsequent surveys (Li et al., 2 Apr 2024).

1. Framework Overview and Problem Formulation

FedKD addresses federated learning with $N$ clients, each holding a private dataset $D_i$ (of size $n_i$ ), and a central server coordinating $R$ communication rounds. Each client $i$ maintains two models: a large, non-communicated teacher $T_i$ ( $\Theta_i^t$ ), optimized only on $D_i$ , and a smaller, shared student $S$ ( $\Theta^s$ ), which is the only model exchanged and aggregated globally. The protocol proceeds as follows for each round $r$ :

The server broadcasts $\Theta^s(r-1)$ to selected clients.
Each client initializes its local student, trains teacher and student on $D_i$ , computes local student gradient $G_i^s$ .
Clients compress $G_i^s$ and upload to the server.
The server aggregates $G^s(r) = \frac{1}{N}\sum_{i=1}^N G_i^s$ and updates $\Theta^s(r)$ .
Repeat until convergence.

The objective is to leverage local data to jointly learn high-quality teachers (for deployment) and an efficient global student, all while minimizing uplink/downlink bandwidth, enforcing no data leakage, and decoupling convergence from the parameter count of the large teacher networks.

2. Adaptive Mutual Distillation Mechanism

Mutual knowledge transfer between teacher and student occurs via jointly optimized loss functions computed on each client. Denoting $y_i^t = \sigma(f(\Theta_i^t; x))$ and $y_i^s = \sigma(f(\Theta^s; x))$ as the teacher and student softmax outputs, and $y$ as the ground-truth one-hot label:

Task loss (cross-entropy):

$L_{t,i}^{task} = CE(y, y_i^t), \qquad L_{s,i}^{task} = CE(y, y_i^s)$

Distillation loss (KL with adaptive reliability): A reliability weight is computed as $w_i = 1/(L_{t,i}^{task} + L_{s,i}^{task})$ ,

$L_{t,i}^{dist} = w_i \cdot KL(y_i^s \| y_i^t), \qquad L_{s,i}^{dist} = w_i \cdot KL(y_i^t \| y_i^s)$

Hidden-state and attention distillation: Denoting $H_i^t, A_i^t$ and $H^s, A^s$ as hidden states and attention maps of teacher and student, and $W_i^h$ as a dimension-matching adaptor,

$L_{t,i}^{hid} = L_{s,i}^{hid} = w_i \cdot [MSE(H_i^t, W_i^h H^s) + MSE(A_i^t, A^s)]$

Total losses:

$L_{t,i} = L_{t,i}^{task} + L_{t,i}^{dist} + L_{t,i}^{hid}$

$L_{s,i} = L_{s,i}^{task} + L_{s,i}^{dist} + L_{s,i}^{hid}$

Gradients $G_i^t = \partial L_{t,i}/\partial \Theta_i^t$ are used for local teacher update, while $G_i^s = \partial L_{s,i}/\partial \Theta^s$ undergoes compression before being transmitted to the server. The reliability scaling ensures distillation is suppressed when predictions are unreliable.

3. Communication-Efficient Gradient Compression

FedKD utilizes a truncated singular value decomposition (SVD) to compress student gradients prior to upload:

For gradient matrix $G_i^s \in \mathbb{R}^{P \times Q}$ , compute the truncated SVD $G_i^s \approx U_i \Sigma_i V_i^\top$ with $K \ll \min(P, Q)$ .
The rank $K$ is chosen adaptively to capture an energy threshold $T(t)$ , which is itself dynamically scheduled as $T(t) = T_{start} + (T_{end} - T_{start}) \cdot (r/R)$ , tightening approximation as training progresses.
The server reconstitutes gradients from received SVD factors and aggregates as above.
Communication cost per client per round is reduced from $P \cdot Q$ to $K \cdot (P + Q + 1)$ scalars, yielding a typical compression ratio $\rho \approx (P \cdot Q)/(K \cdot (P + Q + 1))$ .

This approach achieves an order-of-magnitude reduction in communication compared to uncompressed parameter or gradient exchange.

4. Algorithmic Pseudocode and Workflow

Below is a high-level procedural pseudocode, matching the formal protocol (Wu et al., 2021):

Server: 
  initialize Θ^s(0)
  for r = 1 to R:
    broadcast Θ^s(r-1) to selected clients
    collect (U_i, Σ_i, V_i) from clients
    reconstruct G_i^s = U_i Σ_i V_i^⊤
    compute G^s = (1/N) ∑_i G_i^s
    (optional: SVD compress G^s for downlink)
    update Θ^s(r) = Θ^s(r-1) - η_s G^s

Client i:
  receive Θ^s(r-1)
  set local student Θ^s ← Θ^s(r-1)
  train teacher and student via L_{t,i}, L_{s,i}; update Θ_i^t and cache G_i^s
  compress G_i^s via SVD (threshold T(r)); send (U_i, Σ_i, V_i) to server
  receive global SVD-compressed G^s; update Θ^s ← Θ^s - η_s G^s

5. Communication-Cost Analysis

On practical benchmarks (MIND, SMM4H), the cost per client per round for FedAvg on full teacher models is 2.05 GB (MIND) and 1.37 GB (SMM4H). With FedKD and a student model, FedKD $_4$ achieves 0.19 GB (MIND; %%%%41 $r$ 42%%%% reduction) and 0.12 GB (SMM4H; %%%%43 $\Theta^s(r-1)$ 44%%%% reduction), while FedKD $_2$ further reduces costs. Theoretical scaling is $O(R \cdot |\Theta^s| / \rho) \ll O(R \cdot |\Theta^t|)$ since $|\Theta^s| \ll |\Theta^t|$ and $\rho > 1$ .

Method	MIND (AUC, COMM)	SMM4H (F1, COMM)
FedAvg (teacher)	70.9, 2.05 GB	60.6, 1.37 GB
FedKD $_4$	71.0, 0.19 GB	60.7, 0.12 GB
FedKD $_2$	70.5, 0.11 GB	59.8, 0.07 GB

FedKD matches or slightly exceeds baseline accuracy at drastic communication savings.

6. Empirical Evaluation and Baselines

FedKD is benchmarked on MIND (large-scale news recommendation; metrics: AUC, MRR, nDCGs) and SMM4H (binary ADR tweet detection; metrics: Precision, Recall, F1). Compared baselines include centralized and local learning, distilled models (DistilBERT, TinyBERT), FetchSGD, FedDropout.

On MIND, FedKD $_4$ matches centralized UniLM (AUC=71.0) with %%%%52 $r$ 53%%%% lower bandwidth; on SMM4H, FedKD $_4$ achieves F1=60.7 vs. F1=60.6 (FedAvg), again with %%%%55 $r$ 56%%%% efficiency. Smaller student variants (FedKD $_2$ ) demonstrate further savings with minimal degradation ( $-0.4$ pp AUC on MIND).

Experiments confirm that adaptive mutual distillation and dynamic compression allow FedKD to operate at a fraction of the communication cost without sacrificing model quality, robustly outperforming FedAvg and other state-of-the-art baselines under equivalent resource constraints.

7. Position within the Federated Distillation Landscape

FedKD is situated as a leading communication-efficient solution in the federated distillation (FD) taxonomy, as surveyed in (Li et al., 2 Apr 2024). It uniquely integrates a client-local teacher, a single shared student proxy, SVD-compressed gradient transfer, and adaptive distillation scaling, supporting model heterogeneity and robust privacy—qualities lacking in naive soft-output exchange or full-parameter aggregation. Its two-network per-client framework incurs local compute overhead but empirically yields near-optimal convergence and strong final model performance.

FedKD's methodology contrasts with parameter-averaging protocols (FedAvg), co-distillation, and other FD variants, providing a systematically evaluated trade-off between uplink cost, model accuracy, and hardware flexibility, and is extensible toward adaptive compression, privacy enhancement, and complex tasks beyond standard classification (Wu et al., 2021, Li et al., 2 Apr 2024).

PDF Markdown Chat (Pro)

References (2)

FedKD: Communication Efficient Federated Learning via Knowledge Distillation (2021)

Federated Distillation: A Survey (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to FedKD Framework.