Papers
Topics
Authors
Recent
2000 character limit reached

FedKD: Federated Knowledge Distillation

Updated 25 November 2025
  • FedKD is a federated learning paradigm that exchanges intermediate outputs (e.g., logits, features) via distillation rather than averaging parameters.
  • It utilizes teacher-student, peer-to-peer, and hybrid distillation strategies to overcome non-IID data challenges and support heterogeneous architectures.
  • Empirical results show that FedKD variants achieve comparable or superior accuracy to FedAvg while dramatically reducing communication costs.

Federated Knowledge Distillation (FedKD) denotes a family of federated learning (FL) algorithms in which the transfer of knowledge between clients and server is achieved through distillation rather than direct parameter sharing. In contrast to parameter averaging (FedAvg), FedKD frameworks exchange intermediate outputs (e.g., soft logits or feature activations) computed on a public or synthetic dataset. This enables significant reductions in communication cost, circumvents the need for homogeneous architectures, and often improves robustness to non-IID data. Theoretical analyses and experimental results across a range of scenarios demonstrate that FedKD and its variants deliver competitive or superior accuracy while dramatically lowering communication and relaxing system constraints (Li et al., 2 Apr 2024, Li et al., 2022, Seo et al., 2020, Wu et al., 2021).

1. Core Principles and Distillation Strategies

Classic FL aggregates local model parameters at the server via weighted averaging, as in FedAvg: θgk=1KnkNθk\theta_g \leftarrow \sum_{k=1}^K \frac{n_k}{N}\theta_k where θk\theta_k and nkn_k denote the local parameters and sample size for client kk. This approach assumes identical architectures and is highly sensitive to client-drift under non-IID data, leading to suboptimal global convergence.

In contrast, FedKD frameworks employ the following canonical protocol:

  • Each client evaluates its model on a (usually small) public dataset to produce logits, soft labels, or features.
  • These representations are uploaded to the server, which aggregates them (e.g., by simple mean, weighted mean, or robust ensemble).
  • The aggregate is returned as a "teacher" for local distillation, i.e., clients use a composite loss such as: Ltotal=LCE+αLdistillL_\text{total} = L_\text{CE} + \alpha L_\text{distill} where LdistillL_\text{distill} commonly takes the form of temperature-scaled KL divergence between the client's and teacher's outputs: Ldistill=1Nn=1NDKL(σ(zn(teacher)/T)σ(zn(student)/T))L_\text{distill} = \frac{1}{N}\sum_{n=1}^N D_{KL}\bigl(\sigma(z_n^{(\text{teacher})}/T)\parallel \sigma(z_n^{(\text{student})}/T)\bigr) Key strategies include teacher-student KD, co-distillation (peer-wise or global), adaptive temperature scaling, dynamic weighting of the KD loss, and local-global or bidirectional flows (Li et al., 2 Apr 2024, Li et al., 2022, Seo et al., 2020, Zhao et al., 25 Jun 2025, Zheng et al., 9 Mar 2025, Hossen et al., 26 Aug 2025, Yao et al., 2021).

2. Algorithmic Variants and Advanced Architectures

FedKD encompasses several methodological axes:

  • Server-client distillation: Each client updates its local model by distilling knowledge from aggregated server predictions (Li et al., 2 Apr 2024, Seo et al., 2020).
  • Peer-to-peer distillation: Clients distill from each other's outputs, sometimes directly, to promote consensus or diversity (Li et al., 2 Apr 2024).
  • Hybrid protocols: Some architectures (e.g., FedKD-hybrid) combine parameter averaging for a small subset of layers with logit-level distillation for the remainder, enabling richer functional adaptation under heterogeneity (Li et al., 7 Jan 2025).
  • Decentralized and data-free KD: Schemes such as FedDKD avoid transferring or requiring any data at the server; instead, the functional outputs (neural network maps) of local models are aligned by decentralized divergence and DKD loss (Li et al., 2022). Generator-based FedKD methods employ synthetic (GAN) features when a trusted public dataset is unavailable (Zheng et al., 9 Mar 2025, Zhao et al., 25 Jun 2025, Zheng et al., 9 Mar 2025).
  • Personalized and effective knowledge fusion: FedKD can personalize the distillation process by weighting peer knowledge contributions based on semantic similarity metrics (e.g., via KL of output distributions) rather than uniform averaging (Seyedmohammadi et al., 18 Mar 2024).

Table 1 lists principal axes and example methods:

Approach Category Principle Notable Instances
Parameter-free (logit-based) Only outputs exchanged, any architecture FedMD, FedDF, DS-FL, FedDKD
Hybrid FL/KD Shared layers params + logit KD FedKD-hybrid
Data-free KD Generator synthesizes feature space FedBKD, HFedCKD
Personalization via KD Peer selection/fusion by semantic similarity KnFu
Dual/distinct distillation Logit + prototype, bidirectional, etc. FedProtoKD, FedBKD

3. Communication, Model, and Privacy Efficiency

A primary motivation for FedKD is bandwidth reduction. In parameter-based FL, per-round communication is O(P)O(P) for PP parameters (often 10610^6-10910^9), while FedKD reduces this to O(XpubC)O(|X^\text{pub}| \cdot C) for number of public samples times output class dimension, often a 10210^2-10410^4 fold improvement (Seo et al., 2020, Li et al., 2 Apr 2024, Wu et al., 2021).

FedKD accommodates heterogeneity in both model architecture and system capabilities:

  • Clients can employ different backbones or sizes, as logit/feature-based teacher signals are model-agnostic.
  • The server need not host the full set of client weights, nor enforce parameter compatibility.
  • FedKD is less susceptible to model inversion compared to raw parameter or gradient sharing, though output-based inversion attacks are still possible.

Some designs explicitly address non-homogeneous participation (e.g., IPWD in HFedCKD, which uses inverse-probability weighting to correct for sampling bias in low-participation regimes) or consider importance weighting for knowledge fusion (Zheng et al., 9 Mar 2025, Seyedmohammadi et al., 18 Mar 2024).

4. Theoretical Guarantees and Convergence Analyses

Theoretical understanding of FedKD is less mature than for FedAvg, due to the non-convexity and implicit regularization effects of KD. Partial results include:

  • In the neural tangent kernel (NTK) regime, co-distillation can yield vanishing estimation error as the number of federated peers increases (Seo et al., 2020).
  • Ensembled or stale teacher KD, as in FedGKD, can guarantee convergence to stationary points under standard smoothness, strong convexity, and bounded variance assumptions (Yao et al., 2021).
  • Function-space (output) averaging instead of parameter merging aligns models more naturally in the face of non-uniqueness by permutation (symmetries), which classical FedAvg cannot resolve (Li et al., 2022).
  • Empirical analyses show that adaptively tuned distillation weights, temperature scaling, and semantic neighbor selection can mitigate accuracy loss under extreme heterogeneity (Zhang et al., 11 Aug 2024, Seyedmohammadi et al., 18 Mar 2024).

5. Empirical Results and Benchmarks

FedKD variants outperform or match FedAvg and recent FL enhancements under a range of settings:

  • On vision datasets (CIFAR-10, CIFAR-100) with ResNet backbones and Dirichlet α=0.1\alpha=0.1 non-IID splits, logit-based KD (e.g., FedDF, FedDKD) improves accuracy by 2–5% and reduces communication by up to two orders of magnitude (Li et al., 2022, Li et al., 2 Apr 2024).
  • Domain-specific instantiations (FedKD-hybrid) yield state-of-the-art in lithography hotspot detection with large-scale real-world FAB datasets (Li et al., 7 Jan 2025).
  • Dual-distillation schemes (FedProtoKD) and personalization-aware fusions (KnFu) achieve notable gains under both model and data heterogeneity, particularly in client test performance under severe non-IID distributions (Hossen et al., 26 Aug 2025, Seyedmohammadi et al., 18 Mar 2024).
  • Bidirectional and generator-based approaches deliver leading performance under privacy constraints and limited or unavailable public data (Zhao et al., 25 Jun 2025, Zheng et al., 9 Mar 2025).

A representative report:

Method Setting Acc Improvement Comm. Reduction Reference
FedKD/FedDF CIFAR-10/100 +2–5% vs FedAvg ×10–100 (Li et al., 2 Apr 2024)
FedDKD FEMNIST 92.7% vs 91.3% 1/3 of comm. (Li et al., 2022)
HFedCKD Tiny-ImNet +2.8pp vs naive KD (Zheng et al., 9 Mar 2025)
FedBKD CIFAR-10 +2.7–5% Equal/less (Zhao et al., 25 Jun 2025)
KnFu CIFAR-10 +3–8% vs FedMD ×5–10 (Seyedmohammadi et al., 18 Mar 2024)

6. Open Challenges and Future Directions

Major outstanding issues facing FedKD include:

  • Privacy leakage via logits: Inference and inversion attacks on output activations can compromise privacy, necessitating the integration of differential privacy, perturbed soft labels, or cryptographic aggregation (Li et al., 2 Apr 2024).
  • Proxy/public data mismatch: Reliance on an auxiliary dataset for distillation may introduce bias or degrade convergence, especially when public data do not match client distributions. Generator-based and data-free KD partly address this challenge (Zhao et al., 25 Jun 2025, Zheng et al., 9 Mar 2025).
  • Scalability and heterogeneity: Robustness to client drop-out, asynchronous participation, and streaming or hierarchical topologies remains an active area (e.g., HFedCKD, hierarchical KD) (Zheng et al., 9 Mar 2025, Li et al., 2 Apr 2024).
  • Generalization to structured tasks: Extension of FedKD from classification to detection, segmentation, and transformer-based tasks is ongoing (Li et al., 2 Apr 2024).
  • Theory: Non-convexity, adaptive weighting, and function-space alignment in KD present analytic difficulties; convergence under arbitrary non-IID and heterogeneous architectures is mostly open.

7. Application Domains and Practical Recommendations

FedKD has been deployed in federated recommendation and search, medical imaging, edge vision, IoT sensor networks, and privacy-sensitive industrial settings:

FedKD is regarded as a generalization and extension of classical FL, offering bandwidth-scalable, privacy-conscious, and heterogeneity-resilient methods suitable for advanced collaborative learning deployments (Li et al., 2 Apr 2024, Li et al., 2022, Seo et al., 2020, Zheng et al., 9 Mar 2025, Hossen et al., 26 Aug 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FedKD.