Federated Heterogeneous Distillation

Updated 22 April 2026

Federated heterogeneous distillation is a framework that enables collaborative training across diverse model architectures through knowledge distillation rather than direct weight averaging.
It employs techniques such as logit, feature, and synthetic data distillation to overcome non-IID challenges and reduce communication bandwidth.
This approach improves personalization, scalability, and privacy by decoupling training from homogeneous model constraints.

Federated heterogeneous distillation is a family of methods for enabling collaborative training and knowledge transfer in federated learning (FL) systems where participating clients employ heterogeneous—potentially arbitrarily different—model architectures, operate on statistically heterogeneous (often highly non-IID) data, and have diverse hardware or privacy constraints. This paradigm extends classical FL by replacing or augmenting parameter aggregation with knowledge distillation (KD) mechanisms, allowing information sharing through auxiliary representations such as logits, features, or synthetic data, rather than direct weight averaging. This framework supports heterogeneity in local models, enables targeted personalization, reduces communication cost, and, depending on the implementation, can facilitate improved privacy, robustness, and scalability.

1. Core Principles of Federated Heterogeneous Distillation

In traditional FL, all clients are required to use a homogeneous model architecture so that local models can be directly averaged (e.g. FedAvg). This requirement severely limits flexibility in real-world applications where participants may have different computational resources, data modalities, and performance needs. Federated heterogeneous distillation removes this constraint by shifting aggregation and information sharing into the knowledge space, typically through forms of distillation such as logit-based KL divergence, feature matching, consensus prototypes, or data-free synthetic sets.

Canonical approaches, such as FedMD (Li et al., 2019), FedH2L (Li et al., 2021), and various server-driven logit or feature distillation paradigms, operate as follows:

Local training: Each client trains or fine-tunes its own model (architecture may be unique) on its private, potentially non-IID data.
Knowledge sharing: At communication rounds, clients share soft labels or features (on a public or proxy dataset), or use an auxiliary network for communication (e.g., messengers (Xie et al., 2024)).
Aggregation/Distillation: The server (or peers) aggregates knowledge via logit/feature averaging, ensembling, or contrastive objectives; this is followed by distilling the global knowledge into each client (or a global model).
Personalization: Clients incorporate global knowledge using distillation-regularized objectives, balancing local and federated knowledge.

This approach supports arbitrary model architectures, decentralizes aggregation (where needed), and detaches learning from architectural constraints.

2. Distillation Mechanisms and Communication Strategies

Federated heterogeneous distillation encompasses a diverse set of knowledge-sharing mechanisms and synchronization protocols, including:

Logit and soft-label distillation: Clients output softmax probabilities (optionally temperature-scaled) on an agreed-upon public or synthetic dataset. Averaged logits are treated as a teacher for subsequent distillation steps (e.g., FedMD, FedH2L, FedAUXfdp (Hoech et al., 2022)).
Feature-level distillation and projection: Instead of logits, intermediate features are aligned using projection layers and orthogonal transformations to mitigate representation mismatch (FedFD (Li, 14 Jul 2025)).
Mutual and bidirectional distillation: Clients exchange posteriors with each other (peer-to-peer), or multiple models (of different capacities) are distilled bidirectionally (e.g., CoDA codistillation (Lichtarge et al., 2023)).
Synthetic/public dataset mediation: Auxiliary datasets, either public or server-synthesized (e.g., via conditional generators, as in DFRD (Luo et al., 2023)), serve as the medium for distillation, allowing for knowledge transfer without exposing private data.
Messenger networks: Lightweight, homogeneous messenger models mediate knowledge exchange between entirely heterogeneous backbones, as in MH-pFLID (Xie et al., 2024).

Communication efficiency is achieved by transmitting only soft-label vectors (often a few kilobytes), as opposed to millions of weight parameters, resulting in orders-of-magnitude reduction in bandwidth (Li et al., 2021, Liu et al., 2022).

3. Advanced Optimization, Scalability, and Security

State-of-the-art federated heterogeneous distillation frameworks incorporate algorithmic enhancements for efficiency, scalability, and security, particularly in resource-constrained or adversarial settings.

Joint model selection and resource allocation: Digital twin–assisted knowledge distillation (with server-based big teacher networks) permits adaptive selection of student model sizes, training location (local vs offloaded), and bandwidth/CPU allocation using joint Q-learning and convex optimization (Wang et al., 2023).
Scalability: FedSDD decouples server-side distillation cost from the total number of clients by limiting the distillation ensemble to a small set of aggregated global models and employing temporal ensembling, making the method scalable to large client populations (Kwan et al., 2023).
Peer-to-peer protocols: FedSKD uses round-robin model circulation, combined with multi-dimensional feature similarity distillation (at batch, pixel, region levels), eschewing any central aggregation for maximal scalability and heterogeneity (Weng et al., 23 Mar 2025).
Secure and verifiable aggregation: SVAFD employs client-led Lagrange-coded co-aggregation of logits (rather than server-dominated aggregation), together with cryptographic verification, to deliver robust privacy and integrity for FD in the presence of stragglers and colluding adversaries (Wen et al., 19 May 2025).
Differential privacy: Incorporation of certified DP mechanisms in the head parameters and output soft-labels ensures privacy even under strong adversarial assumptions (FedAUXfdp (Hoech et al., 2022)).

4. Theoretical Analysis and Convergence Guarantees

Multiple works provide theoretical foundations for the convergence and statistical robustness of federated heterogeneous distillation:

Convergence rates: Under standard smoothness and bounded variance assumptions, mutual distillation frameworks (e.g., FedH2L, MH-pFLID) guarantee sublinear convergence rates ( $O(1/T)$ ), even when private model architectures are completely unrelated (Xie et al., 2024, Li et al., 2021).
Variance reduction: In reinforcement learning, action-distribution distillation can provably reduce policy-gradient estimator variance, thereby accelerating convergence and improving sample efficiency (Jiang et al., 2 Feb 2025).
Generalization: Prompt-tuning with logit distillation (FedHPL) admits a multi-term generalization error bound, incorporating prompt-tuning, distillation, and domain shift terms (Ma et al., 2024).
Analytical bounds: Under Gaussian mixture models, asymptotic bounds for self-training performance under FD reveal that larger distillation sets and lower-entropy sampling improve angle alignment to the Bayes classifier (Liu et al., 2022).

No-go theorems remain for settings with extreme model and data heterogeneity without suitable mediating data; however, empirical and analytical results consistently indicate that distillation-based aggregation preserves or improves both system-level and client-level performance under substantial heterogeneity.

5. Empirical Performance Across Modalities

Federated heterogeneous distillation has been extensively validated on diverse datasets and tasks, spanning computer vision (CIFAR-10/100, Tiny-ImageNet), medical diagnostics (fMRI, skin lesion, breast histopathology), natural language processing (AG News, SST-5), time-series (sleep EEG), and reinforcement learning control.

Accuracy improvements: Across benchmarks and under severe non-IID or model-heterogeneous regimes, KD-based methods consistently surpass FedAvg and other aggregation-based baselines. Examples include (i) 18-point improvement in global accuracy under strong non-IID/model-hetero on CIFAR-10 (DFRD (Luo et al., 2023)), (ii) +20% gains over isolated training on MNIST/FEMNIST/CIFAR-100 (FedMD (Li et al., 2019)), and (iii) 1.13–34.13% improvements in pathological splits for prototype-based dual-KD (FedProtoKD (Hossen et al., 26 Aug 2025)).
Personalization: Messenger-based and dual-distillation architectures deliver strong client-specific accuracy without global model compromise (Xie et al., 2024, Weng et al., 23 Mar 2025, Hossen et al., 26 Aug 2025).
Resource and communication efficiency: Prompt-tuning with logit distillation (FedHPL) achieves SOTA performance while reducing trainable parameters by >99% and compressing communication by similar factors (Ma et al., 2024).
Scalability: Frameworks with ensemble-based or peer-to-peer protocols (FedSDD, FedSKD) demonstrate plateaued compute costs as client count increases (in contrast to linear scaling in classical KD/parameter sharing) (Kwan et al., 2023, Weng et al., 23 Mar 2025).
Secured aggregation: SVAFD delivers robust accuracy and low attack success rates under poisoning and inference attacks, maintaining privacy and integrity (Wen et al., 19 May 2025).

6. Representative Methodological Variants

Method	Communication	Model Heterogeneity	Key KD Scheme	Distillation Medium	Security/Scalability
FedMD (Li et al., 2019)	Central server	Arbitrary	Logit KL	Public dataset	Basic/no explicit SA
FedH2L (Li et al., 2021)	Decentralized	Arbitrary	Mutual distill	Public "seed" set	Efficient, no server
DFRD (Luo et al., 2023)	Central server	Arbitrary	Teacher ensemble	Synthetic generator (no data)	Data-free, EMA stability
MH-pFLID (Xie et al., 2024)	Central server	Arbitrary	Dual inject/distill	Messenger network	Personalized, theoretical
FedFD (Li, 14 Jul 2025)	Central server	Grouped	Orthogonal feature	Feature vectors/projection	Bias-corrected, stable
FedGKD (Yao et al., 2021)	Central server	Homogeneous	Historical model KD	Past model ensemble	Theoretically analyzed
FedAUXfdp (Hoech et al., 2022)	Central server	Arbitrary	Certainty-weighted	Logits on public data	Full DP, one-shot
FedSKD (Weng et al., 23 Mar 2025)	Peer-to-peer	Arbitrary	Multi-dimensional	Feature space	Decentralized
FedSDD (Kwan et al., 2023)	Central server	Sub-grouped	Main-only distill	Ensemble of global models	Highly scalable
SVAFD (Wen et al., 19 May 2025)	Hybrid	Arbitrary	Secure distill	Logits, LCC-coded	Co-aggregation, verifiable
FedHPL (Ma et al., 2024)	Central server	Arbitrary	Prompt-logit distill	Prompted backbone logits	Gen. bound, efficient

Each approach operationalizes federated heterogeneous distillation with design choices tailored to task requirements, resource, and privacy constraints.

7. Limitations, Future Directions, and Open Challenges

While federated heterogeneous distillation yields substantial empirical and theoretical advances, several challenges remain:

Public data dependence: Most methods require either a public, unlabeled (or weakly labeled) dataset for communication. Data-free synthetic generation mitigates but does not eliminate the dependency (Luo et al., 2023, Huang et al., 2023).
Privacy risks in soft-label sharing: Soft-logit or feature transmission may incur privacy leakage (requiring techniques such as noise addition, DP, or secure aggregation) (Wen et al., 19 May 2025, Hoech et al., 2022).
Architecture-agnostic feature space alignment: Tackling misaligned/high-dimensional representations in deep heterogeneous models is nontrivial; recent solutions include feature projection (FedFD (Li, 14 Jul 2025)) and messenger networks (Xie et al., 2024).
Optimization and parameter tuning: Balancing between local and federated objectives, as well as hyperparameters such as distillation weights, remains largely empirical and domain-specific.
Scalability in ultra-large systems: While group-ensemble (FedSDD) and peer-to-peer (FedSKD) approaches improve scalability, further progress is required for massive-scale federated settings.

Potential directions include enhancing privacy guarantees (via advanced cryptography or DP), extending distillation frameworks beyond classification (to regression, segmentation, and reinforcement learning (Jiang et al., 2 Feb 2025)), optimizing device-to-device protocols, and reducing reliance on auxiliary data through online generator training and active data selection.

Federated heterogeneous distillation unifies, extends, and strengthens the robustness of distributed learning across model, statistical, and system heterogeneity, delivering scalable, communication-efficient, privacy-aware collaborative learning in real-world, resource-diverse environments.