Asynchronous Local-SGD Training for Language Modeling (2401.09135v2)

Published 17 Jan 2024 in cs.LG and cs.CL

Abstract: Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training LLMs; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

PDF HTML Abstract

Introduction to Asynchronous Local-SGD

LLMs have become crucial in the advancement of machine learning applications, particularly in the field of natural language processing. The traditional way to train such models often involves multiple devices working in tandem using synchronous updates, which can lead to inefficiencies because of communication latency between distributed devices.

Understanding Local-SGD and Its Asynchronous Variant

Local Stochastic Gradient Descent (Local-SGD) offers a way to mitigate the communication bottleneck in distributed training by allowing devices to perform several gradient steps locally before synchronizing. Asynchronous Local-SGD, on the other hand, presents a more dynamic approach, where devices update the global model parameters as soon as they complete their local updates, avoiding the idle time associated with the synchronous method. However, despite its potential, naïve implementations of asynchronous Local-SGD could lead to slower convergence than expected.

Momentum and Heterogeneity in Asynchronous Training

The paper reveals a key issue in asynchronous Local-SGD: the use of stale gradients combined with momentum can disrupt the learning process. A stale gradient arises when a device computes updates based on an older version of the model due to inevitable asynchrony. This complication becomes evident with momentum, which accelerates training by combining past gradients with the current one - the paper explores the intricacies of this phenomenon. To address the identified challenges, the researchers propose two techniques: Delayed Nesterov (DN) momentum update and Dynamic Local Updates (DyLU). These methods are designed to stabilize and improve the performance of asynchronous Local-SGD for LLMs.

Experimenting with Novel Techniques

The paper conducts extensive experiments that demonstrate DN and DyLU's ability to match or even surpass synchronous Local-SGD in terms of learning effectiveness and time efficiency, showing promise for these novel methods. The experiments elaborate on how these techniques cope with heterogeneity in device capabilities and variations in the number of workers and model sizes, indicating the methods' robustness and scalability potential.

Concluding Thoughts

In conclusion, asynchronous Local-SGD presents an attractive alternative for efficiently training LLMs across distributed systems. The paper contributes to this burgeoning domain by addressing key challenges and proposing viable solutions that have been empirically validated. The research opens doors to further enhancements in distributed learning, aiming for greater scalability and reduced training time without compromising the quality of LLMs.