Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning (2402.01798v1)

Published 2 Feb 2024 in cs.LG and cs.DC

Abstract: Gradient compression has surfaced as a key technique to address the challenge of communication efficiency in distributed learning. In distributed deep learning, however, it is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. In this paper, we introduce a novel compression scheme specifically engineered for heavy-tailed gradients, which effectively combines gradient truncation with quantization. This scheme is adeptly implemented within a communication-limited distributed Stochastic Gradient Descent (SGD) framework. We consider a general family of heavy-tail gradients that follow a power-law distribution, we aim to minimize the error resulting from quantization, thereby determining optimal values for two critical parameters: the truncation threshold and the quantization density. We provide a theoretical analysis on the convergence error bound under both uniform and non-uniform quantization scenarios. Comparative experiments with other benchmarks demonstrate the effectiveness of our proposed method in managing the heavy-tailed gradients in a distributed learning environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
  2. S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks,” in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2019, pp. 2238–2247.
  3. D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “Fetchsgd: Communication-efficient federated learning with sketching,” in International Conference on Machine Learning.   PMLR, 2020, pp. 8253–8265.
  4. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, pp. 1709–1720, 2017.
  5. R. Banner, Y. Nahshan, and D. Soudry, “Post training 4-bit quantization of convolutional networks for rapid-deployment,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  6. Y. Liu, S. Rini, S. Salehkalaibar, and J. Chen, “M22: A communication-efficient algorithm for federated learning inspired by rate-distortion,” arXiv preprint arXiv:2301.09269, 2023.
  7. J. Chen, M. K. Ng, and D. Wang, “Quantizing heavy-tailed data in statistical estimation:(near) minimax rates, covariate quantization, and uniform recovery,” IEEE Transactions on Information Theory, 2023.
  8. G. Yan, T. Li, S.-L. Huang, T. Lan, and L. Song, “Ac-sgd: Adaptively compressed sgd for communication-efficient distributed learning,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2678–2693, 2022.
  9. L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
  10. D. Data and S. Diggavi, “Byzantine-resilient high-dimensional federated learning,” IEEE Transactions on Information Theory, 2023.
  11. A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empirical data,” SIAM review, vol. 51, no. 4, pp. 661–703, 2009.
  12. P. Panter and W. Dite, “Quantization distortion in pulse-count modulation with nonuniform spacing of levels,” Proceedings of the IRE, vol. 39, no. 1, pp. 44–48, 1951.
  13. V. Algazi, “Useful approximations to optimum quantization,” IEEE Transactions on Communication Technology, vol. 14, no. 3, pp. 297–301, 1966.
  14. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  15. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” Advances in neural information processing systems, vol. 30, 2017.

Summary

We haven't generated a summary for this paper yet.