Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Communication Efficient Distributed Training with Distributed Lion (2404.00438v1)

Published 30 Mar 2024 in cs.DC, cs.AI, cs.LG, math.OC, and stat.ML

Abstract: The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
  2. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp.  560–569. PMLR, 2018a.
  5. signSGD: Compressed Optimisation for Non-Convex Problems, August 2018b. URL http://arxiv.org/abs/1802.04434. arXiv:1802.04434 [cs, math].
  6. signsgd with majority vote is communication efficient and fault tolerant. arXiv preprint arXiv:1810.05291, 2018c.
  7. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  8. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  9. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
  10. Lion secretly solves constrained optimization: As lyapunov predicts. arXiv preprint arXiv:2310.05898, 2023a.
  11. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023b.
  12. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  14. Robustness to unbounded smoothness of generalized signsgd. arXiv preprint arXiv:2208.11195, 2022.
  15. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  18. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  19. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  20. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  21. Stingy sketch: a sketch framework for accurate and fast frequency estimation. Proceedings of the VLDB Endowment, 15(7):1426–1438, 2022.
  22. Chainedfilter: Combining membership filters by chain rule. Proceedings of the ACM on Management of Data, 1(4):1–27, 2023.
  23. Accelerating distributed deep learning using lossless homomorphic compression, 2024.
  24. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
  25. Asynchronous local-sgd training for language modeling. arXiv preprint arXiv:2401.09135, 2024.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018. URL https://api.semanticscholar.org/CorpusID:52183757.
  28. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In IEEE international conference on neural networks, pp.  586–591. IEEE, 1993.
  31. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  32. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  33. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association, 2014.
  34. Momentum ensures convergence of signsgd under weaker assumptions. In International Conference on Machine Learning, pp.  33077–33099. PMLR, 2023.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017.
  37. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  38. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  39. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  40. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pp.  4120–4129. PMLR, 2017.
Citations (4)

Summary

We haven't generated a summary for this paper yet.