FedKD: Efficient Federated Distillation
- FedKD is a federated learning framework that leverages local teacher models and a shared student model to enable adaptive mutual distillation for efficient learning.
- It integrates dynamic SVD-based gradient compression to significantly reduce the uplink overhead, achieving up to 10× lower communication costs compared to conventional methods.
- FedKD maintains robust model performance on benchmarks like MIND and SMM4H while ensuring privacy and scalability in cross-device federated settings.
The FedKD framework is a knowledge-distillation-based federated learning paradigm designed to address the prohibitive communication overhead characteristic of standard parameter-averaging protocols in cross-device settings. FedKD couples an adaptive mutual distillation process between client-local teacher models and a shared student model with dynamic SVD-based gradient compression to maximize communication efficiency while preserving privacy and accuracy. The following sections delineate the foundational concepts, operational mechanisms, theoretical underpinnings, empirical results, comparative context, and practical implications of FedKD as established in Wu et al. (Wu et al., 2021) and contextualized by subsequent surveys (Li et al., 2024).
1. Framework Overview and Problem Formulation
FedKD addresses federated learning with clients, each holding a private dataset (of size ), and a central server coordinating communication rounds. Each client maintains two models: a large, non-communicated teacher (), optimized only on , and a smaller, shared student (), which is the only model exchanged and aggregated globally. The protocol proceeds as follows for each round 0:
- The server broadcasts 1 to selected clients.
- Each client initializes its local student, trains teacher and student on 2, computes local student gradient 3.
- Clients compress 4 and upload to the server.
- The server aggregates 5 and updates 6.
- Repeat until convergence.
The objective is to leverage local data to jointly learn high-quality teachers (for deployment) and an efficient global student, all while minimizing uplink/downlink bandwidth, enforcing no data leakage, and decoupling convergence from the parameter count of the large teacher networks.
2. Adaptive Mutual Distillation Mechanism
Mutual knowledge transfer between teacher and student occurs via jointly optimized loss functions computed on each client. Denoting 7 and 8 as the teacher and student softmax outputs, and 9 as the ground-truth one-hot label:
- Task loss (cross-entropy):
0
- Distillation loss (KL with adaptive reliability): A reliability weight is computed as 1,
2
- Hidden-state and attention distillation: Denoting 3 and 4 as hidden states and attention maps of teacher and student, and 5 as a dimension-matching adaptor,
6
- Total losses:
7
8
Gradients 9 are used for local teacher update, while 0 undergoes compression before being transmitted to the server. The reliability scaling ensures distillation is suppressed when predictions are unreliable.
3. Communication-Efficient Gradient Compression
FedKD utilizes a truncated singular value decomposition (SVD) to compress student gradients prior to upload:
- For gradient matrix 1, compute the truncated SVD 2 with 3.
- The rank 4 is chosen adaptively to capture an energy threshold 5, which is itself dynamically scheduled as 6, tightening approximation as training progresses.
- The server reconstitutes gradients from received SVD factors and aggregates as above.
- Communication cost per client per round is reduced from 7 to 8 scalars, yielding a typical compression ratio 9.
This approach achieves an order-of-magnitude reduction in communication compared to uncompressed parameter or gradient exchange.
4. Algorithmic Pseudocode and Workflow
Below is a high-level procedural pseudocode, matching the formal protocol (Wu et al., 2021):
9
5. Communication-Cost Analysis
On practical benchmarks (MIND, SMM4H), the cost per client per round for FedAvg on full teacher models is 2.05 GB (MIND) and 1.37 GB (SMM4H). With FedKD and a student model, FedKD0 achieves 0.19 GB (MIND; %%%%41042%%%% reduction) and 0.12 GB (SMM4H; %%%%43144%%%% reduction), while FedKD5 further reduces costs. Theoretical scaling is 6 since 7 and 8.
| Method | MIND (AUC, COMM) | SMM4H (F1, COMM) |
|---|---|---|
| FedAvg (teacher) | 70.9, 2.05 GB | 60.6, 1.37 GB |
| FedKD9 | 71.0, 0.19 GB | 60.7, 0.12 GB |
| FedKD0 | 70.5, 0.11 GB | 59.8, 0.07 GB |
FedKD matches or slightly exceeds baseline accuracy at drastic communication savings.
6. Empirical Evaluation and Baselines
FedKD is benchmarked on MIND (large-scale news recommendation; metrics: AUC, MRR, nDCGs) and SMM4H (binary ADR tweet detection; metrics: Precision, Recall, F1). Compared baselines include centralized and local learning, distilled models (DistilBERT, TinyBERT), FetchSGD, FedDropout.
On MIND, FedKD1 matches centralized UniLM (AUC=71.0) with %%%%52053%%%% lower bandwidth; on SMM4H, FedKD4 achieves F1=60.7 vs. F1=60.6 (FedAvg), again with %%%%55056%%%% efficiency. Smaller student variants (FedKD7) demonstrate further savings with minimal degradation (8 pp AUC on MIND).
Experiments confirm that adaptive mutual distillation and dynamic compression allow FedKD to operate at a fraction of the communication cost without sacrificing model quality, robustly outperforming FedAvg and other state-of-the-art baselines under equivalent resource constraints.
7. Position within the Federated Distillation Landscape
FedKD is situated as a leading communication-efficient solution in the federated distillation (FD) taxonomy, as surveyed in (Li et al., 2024). It uniquely integrates a client-local teacher, a single shared student proxy, SVD-compressed gradient transfer, and adaptive distillation scaling, supporting model heterogeneity and robust privacy—qualities lacking in naive soft-output exchange or full-parameter aggregation. Its two-network per-client framework incurs local compute overhead but empirically yields near-optimal convergence and strong final model performance.
FedKD's methodology contrasts with parameter-averaging protocols (FedAvg), co-distillation, and other FD variants, providing a systematically evaluated trade-off between uplink cost, model accuracy, and hardware flexibility, and is extensible toward adaptive compression, privacy enhancement, and complex tasks beyond standard classification (Wu et al., 2021, Li et al., 2024).