FedKD: Efficient Federated Distillation
- FedKD is a federated learning framework that leverages local teacher models and a shared student model to enable adaptive mutual distillation for efficient learning.
- It integrates dynamic SVD-based gradient compression to significantly reduce the uplink overhead, achieving up to 10× lower communication costs compared to conventional methods.
- FedKD maintains robust model performance on benchmarks like MIND and SMM4H while ensuring privacy and scalability in cross-device federated settings.
The FedKD framework is a knowledge-distillation-based federated learning paradigm designed to address the prohibitive communication overhead characteristic of standard parameter-averaging protocols in cross-device settings. FedKD couples an adaptive mutual distillation process between client-local teacher models and a shared student model with dynamic SVD-based gradient compression to maximize communication efficiency while preserving privacy and accuracy. The following sections delineate the foundational concepts, operational mechanisms, theoretical underpinnings, empirical results, comparative context, and practical implications of FedKD as established in Wu et al. (Wu et al., 2021) and contextualized by subsequent surveys (Li et al., 2 Apr 2024).
1. Framework Overview and Problem Formulation
FedKD addresses federated learning with clients, each holding a private dataset (of size ), and a central server coordinating communication rounds. Each client maintains two models: a large, non-communicated teacher (), optimized only on , and a smaller, shared student (), which is the only model exchanged and aggregated globally. The protocol proceeds as follows for each round :
- The server broadcasts to selected clients.
- Each client initializes its local student, trains teacher and student on , computes local student gradient .
- Clients compress and upload to the server.
- The server aggregates and updates .
- Repeat until convergence.
The objective is to leverage local data to jointly learn high-quality teachers (for deployment) and an efficient global student, all while minimizing uplink/downlink bandwidth, enforcing no data leakage, and decoupling convergence from the parameter count of the large teacher networks.
2. Adaptive Mutual Distillation Mechanism
Mutual knowledge transfer between teacher and student occurs via jointly optimized loss functions computed on each client. Denoting and as the teacher and student softmax outputs, and as the ground-truth one-hot label:
- Task loss (cross-entropy):
- Distillation loss (KL with adaptive reliability): A reliability weight is computed as ,
- Hidden-state and attention distillation: Denoting and as hidden states and attention maps of teacher and student, and as a dimension-matching adaptor,
- Total losses:
Gradients are used for local teacher update, while undergoes compression before being transmitted to the server. The reliability scaling ensures distillation is suppressed when predictions are unreliable.
3. Communication-Efficient Gradient Compression
FedKD utilizes a truncated singular value decomposition (SVD) to compress student gradients prior to upload:
- For gradient matrix , compute the truncated SVD with .
- The rank is chosen adaptively to capture an energy threshold , which is itself dynamically scheduled as , tightening approximation as training progresses.
- The server reconstitutes gradients from received SVD factors and aggregates as above.
- Communication cost per client per round is reduced from to scalars, yielding a typical compression ratio .
This approach achieves an order-of-magnitude reduction in communication compared to uncompressed parameter or gradient exchange.
4. Algorithmic Pseudocode and Workflow
Below is a high-level procedural pseudocode, matching the formal protocol (Wu et al., 2021):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Server:
initialize Θ^s(0)
for r = 1 to R:
broadcast Θ^s(r-1) to selected clients
collect (U_i, Σ_i, V_i) from clients
reconstruct G_i^s = U_i Σ_i V_i^⊤
compute G^s = (1/N) ∑_i G_i^s
(optional: SVD compress G^s for downlink)
update Θ^s(r) = Θ^s(r-1) - η_s G^s
Client i:
receive Θ^s(r-1)
set local student Θ^s ← Θ^s(r-1)
train teacher and student via L_{t,i}, L_{s,i}; update Θ_i^t and cache G_i^s
compress G_i^s via SVD (threshold T(r)); send (U_i, Σ_i, V_i) to server
receive global SVD-compressed G^s; update Θ^s ← Θ^s - η_s G^s |
5. Communication-Cost Analysis
On practical benchmarks (MIND, SMM4H), the cost per client per round for FedAvg on full teacher models is 2.05 GB (MIND) and 1.37 GB (SMM4H). With FedKD and a student model, FedKD achieves 0.19 GB (MIND; %%%%4142%%%% reduction) and 0.12 GB (SMM4H; %%%%4344%%%% reduction), while FedKD further reduces costs. Theoretical scaling is since and .
| Method | MIND (AUC, COMM) | SMM4H (F1, COMM) |
|---|---|---|
| FedAvg (teacher) | 70.9, 2.05 GB | 60.6, 1.37 GB |
| FedKD | 71.0, 0.19 GB | 60.7, 0.12 GB |
| FedKD | 70.5, 0.11 GB | 59.8, 0.07 GB |
FedKD matches or slightly exceeds baseline accuracy at drastic communication savings.
6. Empirical Evaluation and Baselines
FedKD is benchmarked on MIND (large-scale news recommendation; metrics: AUC, MRR, nDCGs) and SMM4H (binary ADR tweet detection; metrics: Precision, Recall, F1). Compared baselines include centralized and local learning, distilled models (DistilBERT, TinyBERT), FetchSGD, FedDropout.
On MIND, FedKD matches centralized UniLM (AUC=71.0) with %%%%5253%%%% lower bandwidth; on SMM4H, FedKD achieves F1=60.7 vs. F1=60.6 (FedAvg), again with %%%%5556%%%% efficiency. Smaller student variants (FedKD) demonstrate further savings with minimal degradation ( pp AUC on MIND).
Experiments confirm that adaptive mutual distillation and dynamic compression allow FedKD to operate at a fraction of the communication cost without sacrificing model quality, robustly outperforming FedAvg and other state-of-the-art baselines under equivalent resource constraints.
7. Position within the Federated Distillation Landscape
FedKD is situated as a leading communication-efficient solution in the federated distillation (FD) taxonomy, as surveyed in (Li et al., 2 Apr 2024). It uniquely integrates a client-local teacher, a single shared student proxy, SVD-compressed gradient transfer, and adaptive distillation scaling, supporting model heterogeneity and robust privacy—qualities lacking in naive soft-output exchange or full-parameter aggregation. Its two-network per-client framework incurs local compute overhead but empirically yields near-optimal convergence and strong final model performance.
FedKD's methodology contrasts with parameter-averaging protocols (FedAvg), co-distillation, and other FD variants, providing a systematically evaluated trade-off between uplink cost, model accuracy, and hardware flexibility, and is extensible toward adaptive compression, privacy enhancement, and complex tasks beyond standard classification (Wu et al., 2021, Li et al., 2 Apr 2024).