Papers
Topics
Authors
Recent
Search
2000 character limit reached

AB-Training: A Communication-Efficient Approach for Distributed Low-Rank Learning

Published 2 May 2024 in cs.LG, cs.AI, and cs.DC | (2405.01067v2)

Abstract: Communication bottlenecks severely hinder the scalability of distributed neural network training, particularly in high-performance computing (HPC) environments. We introduce AB-training, a novel data-parallel method that leverages low-rank representations and independent training groups to significantly reduce communication overhead. Our experiments demonstrate an average reduction in network traffic of approximately 70.31\% across various scaling scenarios, increasing the training potential of communication-constrained systems and accelerating convergence at scale. AB-training also exhibits a pronounced regularization effect at smaller scales, leading to improved generalization while maintaining or even reducing training time. We achieve a remarkable 44.14 : 1 compression ratio on VGG16 trained on CIFAR-10 with minimal accuracy loss, and outperform traditional data parallel training by 1.55\% on ResNet-50 trained on ImageNet-2012. While AB-training is promising, our findings also reveal that large batch effects persist even in low-rank regimes, underscoring the need for further research into optimized update mechanisms for massively distributed training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Learned Gradient Compression for Distributed Deep Learning. IEEE Transactions on Neural Networks and Learning Systems 33, 12 (2022), 7330–7344. https://doi.org/10.1109/TNNLS.2021.3084806
  2. Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52, 4, Article 65 (aug 2019), 43 pages. https://doi.org/10.1145/3320060
  3. Greenformer: Factorization Toolkit for Efficient Deep Neural Networks. https://doi.org/10.48550/arXiv.2109.06762 arXiv:2109.06762 [cs].
  4. Iterative clustering pruning for convolutional neural networks. Knowledge-Based Systems 265 (2023), 110386. https://doi.org/10.1016/j.knosys.2023.110386
  5. Entropy-SGD: biasing gradient descent into wide valleys*. Journal of Statistical Mechanics: Theory and Experiment 2019, 12 (dec 2019), 124018. https://doi.org/10.1088/1742-5468/ab39d9
  6. Accelerating neural network training with distributed asynchronous and selective optimization (DASO). Journal of Big Data 9, 1 (Feb. 2022), 14. https://doi.org/10.1186/s40537-021-00556-1
  7. Harnessing Orthogonality to Train Low-Rank Neural Networks. https://doi.org/10.48550/arXiv.2401.08505 arXiv:2401.08505 [cs].
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arXiv.2010.11929 arXiv:2010.11929 [cs].
  9. A Weighted Average Consensus Approach for Decentralized Federated Learning. Machine Intelligence Research 19, 4 (Aug. 2022), 319–330. https://doi.org/10.1007/s11633-022-1338-z
  10. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. https://doi.org/10.48550/arXiv.1706.02677 arXiv:1706.02677 [cs].
  11. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90 ISSN: 1063-6919.
  12. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, Stockholm, Sweden, 2234–2240.
  13. Channel Pruning for Accelerating Very Deep Neural Networks. In 2017 IEEE International Conference on Computer Vision (ICCV). 1398–1406. https://doi.org/10.1109/ICCV.2017.155 ISSN: 2380-7504.
  14. Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=rkgqN1SYvr
  15. Low Rank Regularization: A review. Neural Networks: The Official Journal of the International Neural Network Society 136 (April 2021), 218–232.
  16. A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. https://www.semanticscholar.org/paper/Learning-Multiple-Layers-of-Features-from-Tiny-Krizhevsky/5d90f06bb70a0a3dced62413346235c02b1aa086
  17. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 25. Curran Associates, Inc. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
  18. Ilya L. and Frank H. 2018. Fixing Weight Decay Regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ
  19. Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 5328–5344. https://proceedings.neurips.cc/paper_files/paper/2021/file/2adcfc3929e7c03fac3100d3ad51da26-Paper.pdf
  20. Channel Pruning via Automatic Structure Search. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, 673–679. https://doi.org/10.24963/ijcai.2020/94 Main track.
  21. Don’t Use Large Mini-Batches, Use Local SGD. https://doi.org/10.48550/arXiv.1808.07217 arXiv:1808.07217 [cs, stat].
  22. Shuying Liu and Weihong Deng. 2015. Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR). 730–734. https://doi.org/10.1109/ACPR.2015.7486599
  23. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. https://doi.org/10.48550/arXiv.1711.05101 arXiv:1711.05101 [cs, math].
  24. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In 2017 IEEE International Conference on Computer Vision (ICCV). 5068–5076. https://doi.org/10.1109/ICCV.2017.541 ISSN: 2380-7504.
  25. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. https://doi.org/10.48550/arXiv.1106.5730 arXiv:1106.5730 [cs, math].
  26. Automatic differentiation in PyTorch. (Oct. 2017). https://openreview.net/forum?id=BJJsrmfCZ
  27. Stable Low-Rank Tensor Decomposition for Compression of Convolutional Neural Network. In Computer Vision – ECCV 2020 (Lecture Notes in Computer Science), A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.). Springer International Publishing, Cham, 522–539. https://doi.org/10.1007/978-3-030-58526-6_31
  28. Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. Advances in Neural Information Processing Systems 35 (Dec. 2022), 20051–20063. https://papers.nips.cc/paper_files/paper/2022/hash/7e98b00eeafcdaeb0c5661fb9355be3a-Abstract-Conference.html
  29. Play and Prune: Adaptive Filter Pruning for Deep Model Compression. (2019), 3460–3466. https://www.ijcai.org/Proceedings/2019/480
  30. Massively Parallel Genetic Optimization Through Asynchronous Propagation of Populations. In High Performance Computing, Abhinav Bhatele, Jeff Hammond, Marc Baboulin, and Carola Kruse (Eds.). Springer Nature Switzerland, Cham, 106–124. https://doi.org/10.1007/978-3-031-32041-5_6
  31. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/hash/d9fbed9da256e344c1fa46bb46c34c5f-Abstract.html
  32. Pufferfish: Communication-efficient Models At No Extra Cost. ArXiv abs/2103.03936 (2021). https://api.semanticscholar.org/CorpusID:232148049
  33. Cuttlefish: Low-Rank Model Training without All the Tuning. ArXiv abs/2305.02538 (2023). https://api.semanticscholar.org/CorpusID:258480187
  34. SVDFed: Enabling Communication-Efficient Federated Learning via Singular-Value-Decomposition. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 1–10. https://doi.org/10.1109/INFOCOM53939.2023.10229042
  35. Lightweight and Efficient End-To-End Speech Recognition Using Low-Rank Transformer. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6144–6148. https://doi.org/10.1109/ICASSP40776.2020.9053878 ISSN: 2379-190X.
  36. Learning Low-Rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
  37. Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2899–2908. https://doi.org/10.1109/CVPRW50498.2020.00347

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.