Homogenizing Non-IID datasets via In-Distribution Knowledge Distillation for Decentralized Learning (2304.04326v2)
Abstract: Decentralized learning enables serverless training of deep neural networks (DNNs) in a distributed manner on multiple nodes. This allows for the use of large datasets, as well as the ability to train with a wide variety of data sources. However, one of the key challenges with decentralized learning is heterogeneity in the data distribution across the nodes. In this paper, we propose In-Distribution Knowledge Distillation (IDKD) to address the challenge of heterogeneous data distribution. The goal of IDKD is to homogenize the data distribution across the nodes. While such data homogenization can be achieved by exchanging data among the nodes sacrificing privacy, IDKD achieves the same objective using a common public dataset across nodes without breaking the privacy constraint. This public dataset is different from the training dataset and is used to distill the knowledge from each node and communicate it to its neighbors through the generated labels. With traditional knowledge distillation, the generalization of the distilled model is reduced because all the public dataset samples are used irrespective of their similarity to the local dataset. Thus, we introduce an Out-of-Distribution (OoD) detector at each node to label a subset of the public dataset that maps close to the local training data distribution. Finally, only labels corresponding to these subsets are exchanged among the nodes and with appropriate label averaging each node is finetuned on these data subsets along with its local data. Our experiments on multiple image classification datasets and graph topologies show that the proposed IDKD scheme is more effective than traditional knowledge distillation and achieves state-of-the-art generalization performance on heterogeneously distributed data with minimal communication overhead.
- Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171, 2019.
- Neighborhood gradient clustering: An efficient decentralized learning method for non-iid data distributions. arXiv preprint arXiv:2209.14390, 2022.
- Siloed federated learning for multi-centric histopathology datasets. In Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning: Second MICCAI Workshop, DART 2020, and First MICCAI Workshop, DCL 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings 2, pp. 129–139. Springer, 2020.
- Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pp. 344–353. PMLR, 2019.
- Cumulated social roles: The duality of persons and their algebras. Social networks, 8(3):215–256, 1986.
- Data-heterogeneity-aware mixing for decentralized learning. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
- Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011.
- Cross-gradient aggregation for decentralized learning from non-iid data. In International Conference on Machine Learning, pp. 3036–3046. PMLR, 2021.
- Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
- Mobile crowdsensing: current state and future challenges. IEEE communications Magazine, 49(11):32–39, 2011.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Using mpi-2: Advanced features of the message passing interface. In 2003 Proceedings IEEE International Conference on Cluster Computing, pp. xix–xix. IEEE Computer Society, 2003.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Hkg4TI9xl.
- Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/abs/1503.02531.
- Jeremy Howard. Imagenette - a subset of 10 easily classified classes from the imagenet dataset. https://github.com/fastai/imagenette, 2018.
- The non-iid data quagmire of decentralized machine learning. In International Conference on Machine Learning, pp. 4387–4398. PMLR, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
- Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. IEEE Transactions on Mobile Computing, 22(1):191–205, 2021.
- Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
- Gossip-based computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp. 482–491. IEEE, 2003.
- Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgGCkrKvH.
- An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422–11435, 2021.
- Consensus control for decentralized deep learning. In International Conference on Machine Learning, pp. 5686–5696. PMLR, 2021.
- Learning multiple layers of features from tiny images, 2009.
- Decentralized federated learning via mutual knowledge transfer. IEEE Internet of Things Journal, 9(2):1136–1147, 2021.
- Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
- Tiny imagenet visual recognition challenge. http://cs231n.stanford.edu/tiny-imagenet-200.zip, 2015.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems, 30, 2017.
- Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pp. 3043–3052. PMLR, 2018.
- Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020.
- Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
- Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In International Conference on Machine Learning, pp. 6654–6665. PMLR, 2021.
- Evolving normalization-activation layers. Advances in Neural Information Processing Systems, 33:13539–13550, 2020a.
- Energy-based out-of-distribution detection. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21464–21475. Curran Associates, Inc., 2020b. URL https://proceedings.neurips.cc/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf.
- Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
- Decentralized gradient methods: does topology matter? In International Conference on Artificial Intelligence and Statistics, pp. 2348–2358. PMLR, 2020.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Mobile crowd sensing in clinical and psychological trials–a case study. In 2015 IEEE 28th international symposium on computer-based medical systems, pp. 23–24. IEEE, 2015.
- Norm-scaling for out-of-distribution detection. arXiv preprint arXiv:2205.03493, 2022.
- Intra-class mixup for out-of-distribution detection. IEEE Access, 11:25968–25981, 2023.
- D22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Decentralized training over decentralized data. In International Conference on Machine Learning, pp. 4848–4856. PMLR, 2018.
- Contrastive representation distillation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgpBJrtvS.
- John N Tsitsiklis. Problems in decentralized decision making and computation. Technical report, Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems, 1984.
- Relaysum for decentralized deep learning on heterogeneous data. Advances in Neural Information Processing Systems, 34:28004–28015, 2021.
- Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(2):869–904, 2020.
- Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
- Mixood: Improving out-of-distribution detection with enhanced data mixup. ACM Transactions on Multimedia Computing, Communications and Applications, 2023.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Decentlam: Decentralized momentum sgd for large-batch deep training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3029–3039, 2021.
- Federated learning on non-iid data: A survey. Neurocomputing, 465:371–390, 2021a.
- Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pp. 12878–12889. PMLR, 2021b.