Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empowering Distributed Training with Sparsity-driven Data Synchronization (2309.13254v2)

Published 23 Sep 2023 in cs.LG and cs.DC

Abstract: Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs. Its performance bottleneck lies in communications for gradient synchronization. Although high tensor sparsity is widely observed, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to bridge this gap. We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones. These findings give a new understanding and inspire us to develop a holistic gradient synchronization system called Zen for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48\times$ speedup in training throughput compared to the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Austin Appleby. Murmurhash 2.0, 2008.
  2. Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In Proceedings of the 25th European MPI Users’ Group Meeting, pages 1–9, 2018.
  3. Balanced allocations. In Proceedings of the twenty-sixth annual ACM symposium on theory of computing, pages 593–602, 1994.
  4. Gradient compression supercharged high-performance data parallel DNN training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 359–375, 2021.
  5. Balanced allocations: The heavily loaded case. SIAM Journal on Computing, 35(6):1350–1385, 2006.
  6. Performance of hashing-based schemes for internet load balancing. In Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No. 00CH37064), volume 1, pages 332–341. IEEE, 2000.
  7. Universal classes of hash functions. In Proceedings of the ninth annual ACM symposium on Theory of computing, pages 106–112, 1977.
  8. MONGOOSE: A learnable LSH framework for efficient neural network training. In International Conference on Learning Representations (ICLR), 2021.
  9. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2:291–306, 2020.
  10. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  11. Perfectly balanced allocation. In Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques, pages 240–251. Springer, 2003.
  12. The power of two choices with simple tabulation. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 1631–1642. SIAM, 2016.
  13. Flare: Flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2021.
  14. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
  15. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  16. Carla Schlatter Ellis. Extendible hashing for concurrent operations and distributed data. In Proceedings of the 2nd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 106–116, 1983.
  17. Efficient sparse collective communication and its application to accelerate distributed deep learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 676–691, 2021.
  18. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  19. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  20. William Gropp. Mpich2: A new start for mpi implementations. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 7–7. Springer, 2002.
  21. Deepfm: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1725–1731, 2017.
  22. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019.
  23. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In 14th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation OSDI 20), pages 463–479, 2020.
  24. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  25. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663, 1997.
  26. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–15, 2019.
  27. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
  28. Atp: In-network aggregation for multi-tenant learning. In NSDI, pages 741–761, 2021.
  29. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
  30. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12).
  31. Embrace: Accelerating sparse communication for distributed training of deep neural networks. In Proceedings of the 51st International Conference on Parallel Processing, pages 1–11, 2022.
  32. Deep gradient compression: Reducing the communication bandwidth for distributed training. The International Conference on Learning Representations (ICLR), 2017.
  33. Lh*—a scalable, distributed data structure. ACM Transactions on Database Systems (TODS), 21(4):480–525, 1996.
  34. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
  35. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, 2020.
  36. Mlperf training benchmark. Proceedings of Machine Learning and Systems, 2:336–349, 2020.
  37. Solar: Sparse orthogonal learned and random embeddings. In International Conference on Learning Representations, 2020.
  38. Extreme classification in log memory using count-min sketch: A case study of amazon search with 50m products. Advances in Neural Information Processing Systems, 32, 2019.
  39. Mellanox. Mellanox Corporate Update. https://www.mellanox.com/related-docs/company/MLNX_Corporate_Deck.pdf, 2022.
  40. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
  41. Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems, 12(10):1094–1104, 2001.
  42. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
  43. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 993–1011, 2022.
  44. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  45. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  46. {{\{{Heterogeneity-Aware}}\}} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481–498, 2020.
  47. Efficient large-scale language model training on GPU clusters using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  48. NVIDIA. A Timeline of Innovation for NVIDIA. ttps://www.nvidia.com/en-us/about-nvidia/corporate-timeline/, 2021.
  49. OpenAI. AI and Compute. https://openai.com/blog/ai-andcompute/, 2021.
  50. Enabling fast and flexible distributed deep learning with programmable switches. arXiv preprint arXiv:2205.05243, 2022.
  51. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1–14, 2018.
  52. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 21), 2021.
  53. ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  54. A performance study of hashing functions for hardware applications. In Proc. of Int. Conf. on Computing and Information, pages 1621–1636. Citeseer, 1994.
  55. Sparcml: High-performance sparse communication for machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2019.
  56. The power of two random choices: A survey of techniques and results. Combinatorial Optimization, 9:255–304, 2001.
  57. Scaling distributed machine learning with {{\{{In-Network}}\}} aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785–808, 2021.
  58. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  59. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  60. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  61. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM computer communication review, 31(4):149–160, 2001.
  62. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1), 2005.
  63. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  64. MK Vijaymeena and K Kavitha. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2):19–28, 2016.
  65. Hi-speed DNN training with espresso: Unleashing the full potential of gradient compression with near-optimal usage strategies. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 867–882, 2023.
  66. Cupcake: A compression scheduler for scalable communication-efficient distributed training. Proceedings of Machine Learning and Systems, 5, 2023.
  67. Dragonn: Distributed randomized approximate gradients of neural networks. In International Conference on Machine Learning, pages 23274–23291. PMLR, 2022.
  68. Udi Wieder et al. Hashing, load balancing and multiple choice. Foundations and Trends® in Theoretical Computer Science, 12(3–4):275–379, 2017.
  69. Grace: A compressed communication framework for distributed machine learning. In Proc. of 41st IEEE Int. Conf. Distributed Computing Systems (ICDCS), 2021.
  70. Drawing early-bird tickets: Toward more efficient training of deep networks. In International Conference on Learning Representations, 2019, 2020.
  71. MiCS: Near-linear scaling for training gigantic model on public. Proceedings of the VLDB Endowment, 16(1):37–50, 2022.
  72. PyTorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhuang Wang (21 papers)
  2. Zhaozhuo Xu (43 papers)
  3. Anshumali Shrivastava (102 papers)
  4. T. S. Eugene Ng (7 papers)
  5. Jingyi Xi (5 papers)
  6. Yuke Wang (23 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.