Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data (2209.15505v2)
Abstract: SGD with momentum is one of the key components for improving the performance of neural networks. For decentralized learning, a straightforward approach using momentum is Distributed SGD (DSGD) with momentum (DSGDm). However, DSGDm performs worse than DSGD when the data distributions are statistically heterogeneous. Recently, several studies have addressed this issue and proposed methods with momentum that are more robust to data heterogeneity than DSGDm, although their convergence rates remain dependent on data heterogeneity and deteriorate when the data distributions are heterogeneous. In this study, we propose Momentum Tracking, which is a method with momentum whose convergence rate is proven to be independent of data heterogeneity. More specifically, we analyze the convergence rate of Momentum Tracking in the setting where the objective function is non-convex and the stochastic gradient is used. Then, we identify that it is independent of data heterogeneity for any momentum coefficient $\beta \in [0, 1)$. Through experiments, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum and can consistently outperform these existing methods when the data distributions are heterogeneous.
- Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, 2019.
- Gtadam: Gradient tracking with adaptive momentum for distributed online optimization. In arXiv, 2022.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.
- Momentum improves normalized SGD. In International Conference on Machine Learning, 2020.
- Aaron Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization. In arXiv, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
- Cross-gradient aggregation for decentralized learning from non-iid data. In International Conference on Machine Learning, 2021.
- Periodic stochastic gradient descent with momentum for decentralized training. In arXiv, 2020.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- The non-IID data quagmire of decentralized machine learning. In International Conference on Machine Learning, 2020.
- Measuring the effects of non-identical data distribution for federated visual classification. In ArXiv, 2019.
- Breaking the centralized barrier for cross-device federated learning. In Advances in Neural Information Processing Systems, 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, 2020.
- An improved analysis of gradient tracking for decentralized machine learning. In Advances in Neural Information Processing Systems, 2021.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Refined convergence and topology learning for decentralized sgd with heterogeneous data. In International Conference on Artificial Intelligence and Statistics, 2023.
- Gradient-based learning applied to document recognition. In IEEE, 1998.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, 2017.
- Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, 2018.
- Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In International Conference on Machine Learning, 2021.
- On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020a.
- Linear convergent decentralized optimization with compression. In International Conference on Learning Representations, 2021.
- An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, 2020b.
- NEXT: in-network nonconvex optimization. In IEEE Transactions on Signal and Information Processing over Networks, 2016.
- Moniqua: Modulo quantized communication in decentralized SGD. In International Conference on Machine Learning, 2020.
- Optimal complexity in decentralized training. In International Conference on Machine Learning, 2021.
- Achieving geometric convergence for distributed optimization over time-varying graphs. In SIAM Journal on Optimization, 2017.
- The role of network topology for distributed machine learning. In IEEE Conference on Computer Communications, 2019.
- Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems Workshop, 2011.
- Edge-consensus learning: Deep learning on p2p networks with nonhomogeneous data. In International Conference on Knowledge Discovery and Data Mining, 2020.
- Asynchronous decentralized optimization with implicit stochastic variance reduction. In International Conference on Machine Learning, 2021.
- Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, 1964.
- Distributed stochastic gradient tracking methods. In Math. Program., 2021.
- Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. In IEEE Transactions on Control of Network Systems, 2018.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Squarm-sgd: Communication-efficient momentum sgd for decentralized optimization. In IEEE International Symposium on Information Theory, 2021.
- Communication compression for decentralized learning with operator splitting methods. In arXiv, 2022a.
- Theoretical analysis of primal-dual algorithm for non-convex stochastic decentralized optimization. In arXiv, 2022b.
- Communication compression for decentralized training. In Advances in Neural Information Processing Systems, 2018a.
- d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Decentralized training over decentralized data. In International Conference on Machine Learning, 2018b.
- RelaySum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems, 2021.
- Matcha: Speeding up decentralized sgd via matching decomposition sampling. In Indian Control Conference, 2019.
- SlowMo: Improving communication-efficient distributed sgd with slow momentum. In International Conference on Learning Representations, 2020a.
- Escaping saddle points faster with stochastic momentum. In International Conference on Learning Representations, 2020b.
- Group normalization. In European Conference on Computer Vision, 2018.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In arXiv, 2017.
- Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking. In IEEE Transactions on Automatic Control, 2020.
- Fast decentralized nonconvex finite-sum optimization with recursive variance reduction. In SIAM Journal on Optimization, 2022.
- A unified analysis of stochastic momentum methods for deep learning. In International Joint Conference on Artificial Intelligence, 2018.
- Exponential graph is provably efficient for decentralized deep training. In Advances in Neural Information Processing Systems, 2021.
- On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In International Conference on Machine Learning, 2019.
- DecentLaM: Decentralized momentum sgd for large-batch deep training. In International Conference on Computer Vision, 2021.
- Revisiting optimal convergence rate for smooth and non-convex stochastic decentralized optimization. In Advances in Neural Information Processing Systems, 2022.
- Beer: Fast o(1/t)𝑜1𝑡o(1/t)italic_o ( 1 / italic_t ) rate for decentralized nonconvex optimization with communication compression. In Advances in Neural Information Processing Systems, 2022.
- Yuki Takezawa (16 papers)
- Han Bao (77 papers)
- Kenta Niwa (19 papers)
- Ryoma Sato (33 papers)
- Makoto Yamada (84 papers)