Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding the Role of Layer Normalization in Label-Skewed Federated Learning (2308.09565v2)

Published 18 Aug 2023 in cs.LG and stat.ML

Abstract: Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes. Our code is available at \url{https://github.com/huawei-noah/Federated-Learning/tree/main/Layer_Normalization}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. On the expressivity role of layernorm in transformers’ attention. arXiv preprint arXiv:2305.02582, 2023.
  3. Experimenting with normalization layers in federated learning on non-iid scenarios. arXiv preprint arXiv:2303.10630, 2023.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  5. Rethinking normalization methods in federated learning. arXiv preprint arXiv:2210.03277, 2022.
  6. Optimization theory for ReLU neural networks trained with normalization layers. In International Conference on Machine Learning, pp.  2751–2760. PMLR, 2020.
  7. A theoretical analysis of the learning dynamics under class imbalance. In International Conference on Machine Learning, pp.  10285–10322. PMLR, 2023.
  8. Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3):121–136, 1975.
  9. The expressive power of tuning only the norm layers. arXiv preprint arXiv:2302.07937, 2023.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  11. The non-iid data quagmire of decentralized machine learning. In International Conference on Machine Learning, pp.  4387–4398. PMLR, 2020.
  12. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  13. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp.  448–456. PMLR, 2015.
  14. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp.  5132–5143. PMLR, 2020.
  15. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  16. Learning multiple layers of features from tiny images, 2009. Technical report.
  17. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  18. Efficient backprop. In Neural Networks: Tricks of the Trade, pp.  9–50. Springer, 1998.
  19. Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp.  5542–5550, 2017.
  20. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020a.
  21. FedBN: federated learning on non-iid features via local batch normalization. In International Conference on Learning Representations, 2020b.
  22. FedRS: federated learning with restricted softmax for label distribution non-iid data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  995–1005, 2021.
  23. Understanding the generalization benefit of normalization layers: Sharpness reduction. arXiv preprint arXiv:2206.07085, 2022.
  24. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.  109–165. Elsevier, 1989.
  25. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp.  1273–1282. PMLR, 2017.
  26. On the expressive power of deep neural networks. In International Conference on Machine Learning, pp.  2847–2854. PMLR, 2017.
  27. Adaptive federated optimization. In International Conference on Learning Representations, 2020.
  28. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems, 29, 2016.
  29. How does batch normalization help optimization? Advances in Neural Information Processing Systems, 31, 2018.
  30. An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, 2022.
  31. Towards understanding and mitigating dimensional collapse in heterogeneous federated learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=EXnIyMVTL8s.
  32. Natural image statistics and neural representation. Annual review of neuroscience, 24(1):1193–1216, 2001.
  33. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  34. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  35. Why batch normalization damage federated learning on non-iid data? arXiv preprint arXiv:2301.02982, 2023.
  36. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  3–19, 2018.
  37. Federated learning with only positive labels. In International Conference on Machine Learning, pp.  10946–10956. PMLR, 2020.
  38. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  39. Proportional fairness in federated learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=ryUHgEdWCQ.
  40. Federated learning with label distribution skew via logits calibration. In International Conference on Machine Learning, pp.  26311–26329. PMLR, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets