GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models (2403.04206v1)
Abstract: We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by achieving faster convergence and recovering better quality and flatter local optima. We also carry out an ablation study to analyze the scalability of the proposed algorithms in more crowded distributed training environments. Finally, we report that our approach requires less frequent communication and fewer distributed updates compared to the state-of-the-art baselines.
- T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1–43, 2019.
- Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape. In International Conference on Artificial Intelligence and Statistics, pages 8299–8339. PMLR, 2022.
- L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, 2011. doi: 10.1561/2200000016. URL https://doi.org/10.1561/2200000016.
- Entropy-sgd: biasing gradient descent into wide valleys*. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, dec 2019. doi: 10.1088/1742-5468/ab39d9. URL https://dx.doi.org/10.1088/1742-5468/ab39d9.
- The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=6Tm1mposlrM.
- Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH.
- K. Kawaguchi. Deep learning without poor local minima. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/f2fc990265c712c49d51a18a32b39f0c-Paper.pdf.
- J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
- On large-batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Cifar-10 (canadian institute for advanced research). a. URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Cifar-100 (canadian institute for advanced research). b. URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Pytorch distributed: Experiences on accelerating data parallel training. CoRR, abs/2006.15704, 2020. URL https://arxiv.org/abs/2006.15704.
- D. MacKay. Bayesian model comparison and backprop nets. In J. Moody, S. Hanson, and R. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991.
- Rethinking parameter counting in deep models: Effective dimensionality revisited. CoRR, abs/2003.02139, 2020a. URL https://arxiv.org/abs/2003.02139.
- Rethinking parameter counting in deep models: Effective dimensionality revisited. CoRR, abs/2003.02139, 2020b. URL https://arxiv.org/abs/2003.02139.
- The lanczos algorithm with selective orthogonalization. Mathematics of computation, 33(145):217–238, 1979.
- Regularizing neural networks by penalizing confident output distributions, 2017. URL https://openreview.net/forum?id=HkCjNI5ex.
- N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98)00116-6. URL https://www.sciencedirect.com/science/article/pii/S0893608098001166.
- S. Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016. URL http://arxiv.org/abs/1609.04747.
- H. D. Simon. The lanczos algorithm with partial reorthogonalization. Mathematics of computation, 42(165):115–142, 1984.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
- Leader stochastic gradient descent for distributed training of deep learning models. Advances in Neural Information Processing Systems, 32, 2019.
- A survey on distributed machine learning. ACM Computing Surveys (CSUR), 53(2):1–33, 2020.
- A survey of deep learning techniques for neural machine translation. CoRR, abs/2002.07526, 2020. URL https://arxiv.org/abs/2002.07526.
- S. Zagoruyko and N. Komodakis. Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
- Deep learning with elastic averaging sgd. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 685–693, Cambridge, MA, USA, 2015. MIT Press.