FedKD: Communication Efficient Federated Learning via Knowledge Distillation (2108.13323v2)

Published 30 Aug 2021 in cs.LG and cs.CL

Abstract: Federated learning is widely used to learn intelligent models from decentralized data. In federated learning, clients need to communicate their local model updates in each iteration of model learning. However, model updates are large in size if the model contains numerous parameters, and there usually needs many rounds of communication until model converges. Thus, the communication cost in federated learning can be quite heavy. In this paper, we propose a communication efficient federated learning method based on knowledge distillation. Instead of directly communicating the large models between clients and server, we propose an adaptive mutual distillation framework to reciprocally learn a student and a teacher model on each client, where only the student model is shared by different clients and updated collaboratively to reduce the communication cost. Both the teacher and student on each client are learned on its local data and the knowledge distilled from each other, where their distillation intensities are controlled by their prediction quality. To further reduce the communication cost, we propose a dynamic gradient approximation method based on singular value decomposition to approximate the exchanged gradients with dynamic precision. Extensive experiments on benchmark datasets in different tasks show that our approach can effectively reduce the communication cost and achieve competitive results.

Citations (316)

View on Semantic Scholar

Summary

The paper introduces FedKD, which uses adaptive mutual knowledge distillation to reduce communication overhead in federated learning.
It employs dynamic gradient approximation with singular value decomposition to compress updates, significantly lowering data transmission (e.g., from 2.05GB to 0.19GB).
The approach maintains competitive predictive performance on benchmark tasks, enabling efficient deployment on resource-constrained devices.

Communication Efficient Federated Learning via Knowledge Distillation: A Summary of FedKD

The paper "FedKD: Communication Efficient Federated Learning via Knowledge Distillation" addresses the often-overlooked challenge of communication inefficiency in federated learning (FL). Federated learning inherently allows the development of intelligent models while respecting data decentralization and user privacy. However, the extensive communication of large model updates between clients and servers can lead to significant overhead, especially with models of substantial size.

Overview of Contributions

To tackle this challenge, the authors propose FedKD, a novel approach that introduces knowledge distillation to federated learning. The core idea involves reducing communication costs by employing a smaller student model to encapsulate and distribute knowledge from a larger teacher model. Specifically, the paper presents several key innovations:

Adaptive Mutual Knowledge Distillation: This technique enables both the teacher and student models on each client to reciprocally learn from each other. The "distillation" intensity adapts based on prediction correctness, aiming to enhance both the teacher's and student's performance without excessive communication.
Dynamic Gradient Approximation: By leveraging singular value decomposition (SVD), the authors implement a method to compress gradients, dynamically adjusting precision to control communication overhead.
Federated Learning Integration: FedKD effectively marries the strengths of knowledge distillation with federated learning, achieving competitive task performance with markedly reduced communication costs.

Experimental Validation

Extensive experiments were conducted to demonstrate FedKD's efficacy across various tasks, using benchmark datasets such as MIND for news recommendation and SMM4H for adverse drug reaction detection. The results consistently indicated that FedKD could achieve comparable, if not superior, predictive performance to traditional FL methods like FedAvg, yet significantly reduced the communication burden. Key numerical findings highlight the reduced communication, with one example showing a decrease to 0.19GB from 2.05GB when training models on certain datasets.

Implications and Future Directions

The implications of FedKD are manifold. Practically, it offers a viable pathway for deploying large-scale, privacy-preserving models on mobile or resource-constrained devices, where communication costs and energy efficiency are pivotal. Theoretically, this work opens new avenues in exploring the intersections of FL and model compression techniques, further supported by distillation methods.

Looking forward, potential developments could explore enhanced distillation techniques that further refine mutual learning dynamics between teacher and student models, or advanced gradient compression techniques preserving more model variance with lesser data. The paper sets the stage for future research to build more efficient, scalable FL systems applicable across domains demanding high privacy standards yet constrained by infrastructure.

In summary, this work contributes significantly to the field of federated learning by offering a novel solution to its inherent communication inefficiencies, thus advancing the practical applicability of FL methodologies in real-world scenarios.

PDF Markdown