NoLoCo: No-all-reduce Low Communication Training Method for Large Models (2506.10911v1)

Published 12 Jun 2025 in cs.LG

Abstract: Training LLMs is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from LLM training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to $4\%$ faster convergence rate with wide range of model sizes and accelerator counts.

PDF Abstract

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Training LLMs frequently entails using massive compute clusters equipped with high-bandwidth connectivity to facilitate synchronizing tens of thousands of accelerators. The corresponding scalability limitations and costs pose a challenge, driving researchers to develop training methodologies that utilize less communication. Recent advancements aim to alleviate bandwidth requirements, nonetheless, they still incorporate synchronization operations that can be costly in environments fraught with low bandwidth and high latency.

NoLoCo presents an innovative training mechanism devoid of explicit synchronization of all model parameters, thereby negating the necessity for collective communication. Utilizing a newly crafted variant of Nesterov momentum, NoLoCo seamlessly synchronizes model weights via selective averaging with another randomly chosen model instance. The theoretical underpinnings affirm the convergence of this optimizer, and empirical evaluations on varied model scales corroborate its efficacy.

Key Methodology

NoLoCo circumvents synchronization pitfalls by:

Decentralized Synchronization: Parameters synchronization is localized, avoiding the extensive all-to-all communication customary in many frameworks.
Dynamic Pipeline Routing: Employing random connections between pipeline stages across accelerators, thus facilitating implicit weight synchronization and reducing blocking communications.
Modified Nesterov Momentum: This guarantees weights of different data parallel replicas progressively align, ensuring convergence over time without comprehensive synchronization.

Numerical Findings

The benchmarking encompassed models ranging from 125M to 6.8B parameters, highlighting:

NoLoCo required notably less communication overhead compared to fully sharded data parallelization or DiLoCo.
An estimated magnitude faster synchronization for NoLoCo compared to DiLoCo, particularly in configurations employing hundreds of accelerators.
Up to 4% improved convergence rates relative to DiLoCo across various model sizes and worker configurations, suggesting effective implicit synchronization.

Implications and Future Directions

The implications of NoLoCo's approach are multidimensional:

Practical Aspects: For practitioners who operate in bandwidth-constrained environments or distributed networks, NoLoCo offers a tangible reduction in communication overhead and latency, effectively accelerating training processes at lowered infrastructure costs.
Theoretical Horizons: The methodological divergences presented may inspire adaptations in existing decentralized optimization algorithms, potentially allowing further reductions in synchronization reliance.

Future research into decentralized training may benefit from exploring:

Modifying the proposed synchronization mechanism's parametrization to further boost convergence rates and tuning it for diverse model architectures.
Evaluating its scalability and performance in real-world, geo-distributed network scenarios, where latency variability is pronounced.

In summary, NoLoCo provides a feasible alternative to traditional synchronization-heavy methodologies by leveraging local communication paradigms and dynamic routing, which could pave the way for efficiently training large models in constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jari Kolehmainen (13 papers)
Nikolay Blagoev (3 papers)
John Donaghy (3 papers)
Oğuzhan Ersoy (5 papers)
Christopher Nies (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/gensynai/status/1933566627548963189

https://twitter.com/gensynai/status/1933566623270719726

https://twitter.com/arxivsanitybot/status/1933719113760977227