Over-the-Air Federated Distillation
- Over-the-Air Federated Distillation (OTA-FD) is an edge learning approach that aggregates low-dimensional soft outputs through wireless channel superposition to reduce communication overhead.
- It leverages both coherent and noncoherent OTA aggregation methods to optimize beamforming, privacy mechanisms, and convergence speed in federated distillation systems.
- OTA-FD balances trade-offs between communication, privacy, and learning by integrating differential privacy, power control, and robust aggregation strategies for diverse wireless scenarios.
Over-the-Air Federated Distillation (OTA-FD) is an edge learning paradigm that combines federated distillation (FD) with over-the-air (OTA) wireless aggregation. In OTA-FD, edge devices cooperate by exchanging low-dimensional model outputs—typically class-wise soft-output vectors—aggregated directly via the wireless channel's superposition property, thus jointly leveraging communication, computation, and privacy advantages. This approach sharply reduces communication overhead compared to conventional federated learning (FL), mitigates the impact of channel resources and non-idealities, and enables privacy protection via distributed differential privacy mechanisms.
1. System Model and Protocol Structure
An OTA-FD system consists of wireless devices (WDs), each holding private data drawn from classes, coordinated over training rounds by a parameter server (PS). At each round:
- Local Knowledge Extraction: WD computes, for each class , the mean soft-prediction vector
where is the softmax output of the local model on input .
- Signal Preparation and Transmission: Each WD encodes its mean vectors, possibly injecting additive noise (e.g., for privacy), forming a transmit vector 0.
- OTA Aggregation: All WDs transmit 1 simultaneously. The PS receives a linear superposition corrupted by fading and noise:
2
where 3 captures fading and 4 is noise.
- Global Knowledge Estimation: The PS applies a linear estimate (e.g., dividing by a normalization scalar 5) to recover a noisy global soft-label vector 6 per class.
- Broadcast and Local Update: PS broadcasts 7 to all WDs, which update their models by minimizing an FD loss that blends empirical and distillation (teacher-student) components.
This sequence leverages the multiple-access channel to compute ensemble model outputs directly in the physical layer, with OTA aggregation acting as an analog compute primitive (Seo et al., 2020, Ahn et al., 2020, Hu et al., 21 Jul 2025, Hu et al., 6 Aug 2025).
2. OTA Aggregation Mechanisms: Coherent and Noncoherent
Coherent OTA-FD requires uplink channel state information (CSI) and phase-aligned transmissions. Each device precodes its signal to compensate for its channel phase, so encoded model vectors add constructively at the PS (Seo et al., 2020, Hu et al., 21 Jul 2025). Beamforming and power control at the transmitter and receiver optimize alignment and minimize aggregation mean-squared error, subject to peak power constraints.
Noncoherent OTA-FD avoids uplink pilots and per-round CSI. Each device maps its soft-label probabilities to transmit energies (typically with constant-envelope/constant-power waveforms), enabling unbiased aggregation at the PS via energy detection. The SCENE estimator, for example, removes the noise-induced offset and normalizes per-class energies to achieve unbiasedness and low variance, with MSE scaling as 8, where 9 is the repetition factor and 0 the number of antennas (Chen et al., 17 Feb 2026).
| Variant | Uplink Alignment | Aggregation Primitive | CSI Requirement |
|---|---|---|---|
| Coherent OTA-FD | Phase-aligned | Weighted sum (complex) | Per-round, per-user |
| Noncoherent OTA-FD (SCENE) | No alignment | Energy-based averaging | None (pilot-free) |
3. Communication–Learning–Privacy Co-Design
OTA-FD modifies the canonical FL communication-optimization landscape in several respects (Hu et al., 6 Aug 2025, Hu et al., 21 Jul 2025):
- Optimization Variables:
- Number of FD rounds 1 (long-term design),
- Transceiver parameters per round: transmit scalars/powers 2, 3 (split between information and noise), receiver combiners or estimators 4.
- Joint Objective: Minimize asymptotic expected squared gradient norm (proxy for convergence rate) while satisfying:
- Peak power constraints (5),
- Differential privacy (DP) constraints per user and class (quantified via Gaussian mechanism and moments accountant),
- Alignment and aggregation quality measured via mismatch and noise variance terms 6 and 7.
Closed-form design solutions exist, particularly when DP noise can be shared among users. For instance, alignment error is exactly zeroed by setting
8
with remaining degrees of freedom allocated to privacy and channel noise. The number of training rounds is optimized to balance diminishing returns in loss with the accumulation of DP noise (Hu et al., 6 Aug 2025).
4. Analytical Performance Bounds
Rigorous analysis provides convergence-rate bounds and privacy loss estimates:
- Convergence Rate: Under standard assumptions and diminishing stepsize schedule, the expected gradient norm decays as 9, augmented by alignment and noise-induced gaps:
0
- Aggregation Error: The mismatch (1) quantifies how well the air-aggregated vector approximates target averages, depending on transceiver choices and channel conditions.
- Privacy Guarantee: DP is maintained via per-user noise injection. To ensure 2-DP over 3 rounds, total noise variance (including channel) must exceed
4
with 5 (Hu et al., 6 Aug 2025).
Noncoherent variants (e.g., SCENE) maintain unbiasedness for global model outputs regardless of channel phase, and achieve variance scaling as 6 (Chen et al., 17 Feb 2026).
5. Trade-offs: Communication, Privacy, and Learning
OTA-FD presents a multifaceted design trade-off space described in multiple studies (Hu et al., 6 Aug 2025, Hu et al., 21 Jul 2025, Seo et al., 2020, Ahn et al., 2020, Ahn et al., 2019, Chen et al., 17 Feb 2026):
- Communication Overhead: OTA-FD drastically reduces required channel use per round. Example: exchanging 7 entries per round for MNIST (8) versus 9 for full-model FL.
- Privacy-Utility: DP noise for FD scales with 0 (number of classes), compared to FL where noise must scale with model dimension; thus, FD tolerates stricter privacy regimes before significant utility loss.
- Learning Speed: Fast convergence relative to channel and privacy-constrained FL. Increasing DP strength (lower 1) or data imbalance (high 2) shrinks optimal 3.
- Robustness: Noncoherent OTA-FD (SCENE) outperforms coherent algorithms under high pilot overhead or short channel coherence, benefiting from hardware-friendliness and pilot-free operation (Chen et al., 17 Feb 2026).
- Spectral Efficiency and SNR: At low SNR or tight bandwidth, analog OTA-FD surpasses digital FL and digital FD; at high SNR and wide bandwidth, digital FD and FL regain the advantage (Ahn et al., 2020, Ahn et al., 2019).
6. Practical Implementations and Variants
Several OTA-FD realizations are reported:
- FedAvg/FL vs. FD vs. Hybrid FD (HFD): FL transmits full models/gradients, FD exchanges only soft outputs, HFD enhances FD by mixing “covariate vectors” before distillation for improved generalization (Ahn et al., 2020, Ahn et al., 2019).
- Analog OTA-FD: (Joint source–channel coding) Direct, redundancy-coded transmission of logits/soft-labels using analog modulation, with repetition coding for MSE reduction (Ahn et al., 2020, Ahn et al., 2019).
- Digital OTA-FD: (Source–channel separated) Quantized, possibly sparsified transmission, relying on digital communication, less robust to strict spectral constraints.
- SCENE Noncoherent OTA-FD: Estimator with constant-envelope signaling and energy-based detection for pilot-free and hardware-friendly operation (Chen et al., 17 Feb 2026).
PHY-layer considerations include synchronization/beaconing, channel estimation for coherent schemes, power control, and fallback strategies to digital aggregation for devices with deep fades (Seo et al., 2020). Beamforming optimization on the PS can further improve learning performance under MIMO architectures (Hu et al., 21 Jul 2025).
7. Summary Table of OTA-FD Techniques
| Reference | OTA Aggregation Mechanism | Privacy | Transceiver Optimization | Empirical Observations |
|---|---|---|---|---|
| (Hu et al., 6 Aug 2025) | Coherent, CSI-based, 4-dim | Gaussian DP, joint | Closed-form, 2-timescale | Superior privacy/utility tradeoff, order-of-magnitude speedup over FL |
| (Seo et al., 2020) | Coherent, analog summation | None | Power/phase alignment | O(M)-size payload, fast convergence, SNR-dependent accuracy |
| (Hu et al., 21 Jul 2025) | Uplink power/beamformer opt. | None | SDR-based beamforming | Near-FD performance, drastic comm. savings, robust to imperfect CSI |
| (Ahn et al., 2020) | Both analog and digital, HFD | None | Compression/repetition | HFD outperforms FL/FD at low SNR, analog robust at low bandwidth |
| (Chen et al., 17 Feb 2026) | Noncoherent, energy detection | None | Constant-envelope HW | Pilot-free, O(1/(SM)) MSE, preferred for short coherence/hardware limits |
8. Concluding Remarks
Over-the-Air Federated Distillation, including its privacy-protected and noncoherent variants, enables dramatic reductions in wireless edge learning communication costs relative to FL, with minor or no compromise in learning performance. The co-design of communication, privacy, and learning variables is both analytically tractable and practically critical—the optimality of protocol configuration fundamentally depends on the device/channel environment, privacy regime, and resource constraints. Noncoherent schemes such as SCENE are especially suited to short-coherence and hardware-restricted deployments, while full coherent OTA-FD with beamforming and DP noise injection delivers near-ideal trade-offs in lower-mobility, high-SNR wireless networks. These advances position OTA-FD as a central technique for scalable and secure federated intelligence at the wireless edge (Hu et al., 6 Aug 2025, Hu et al., 21 Jul 2025, Seo et al., 2020, Ahn et al., 2020, Ahn et al., 2019, Chen et al., 17 Feb 2026).