DiPaCo: Distributed Path Composition (2403.10616v1)

Published 15 Mar 2024 in cs.LG and cs.CL

Abstract: Progress in ML has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer LLM by choosing one of 256 possible paths, each with a size of 150 million parameters.

PDF Abstract

Distributed Path Composition (DiPaCo) for Large-Scale Modular Deep Learning

Introduction

The emergence of Distributed Path Composition (DiPaCo), a novel architecture and training paradigm, hints at a significant shift in the landscape of distributed learning. By integrating modularity into the design and optimization of machine learning models, DiPaCo not only addresses the challenges posed by the current synchronous and monolithic training regimes but also paves the way for a scalable and flexible paradigm that promises to exploit distributed computational resources more efficiently.

The DiPaCo Framework

DiPaCo stands out by its unique approach to distributing computation across a network of loosely connected computational units, referred to as paths. These units are composed of a sequence of shared modules, enabling the model to efficiently handle a diverse set of tasks by reconfiguring paths across these modules. The core of DiPaCo lies in its two-fold strategy:

Coarse Routing

The system employs a coarse routing mechanism whereby each document is assigned a path based on its content, enabling the model to handle it with a specifically tailored computational flow. This process notably includes steps like k-Means clustering for initial routing followed by more sophisticated discriminative routing techniques for refining path assignments based on model performance metrics.

Distributed Localized Optimization (DiLoCo)

Optimizing a model with shared modules across different paths introduces synchronization challenges, especially under the assumption of costly communication. DiPaCo addresses this through DiLoCo, a variation of Local-SGD that synchronizes module parameters only intermittently, significantly reducing the necessity for high-bandwidth inter-device communication.

Experimental Results

DiPaCo's efficacy was rigorously evaluated on the C4 benchmark, comparing against both a dense transformer model and a flat mixture of experts baseline. Notably, the experiments demonstrate that a DiPaCo model, choosing one out of 256 paths for each input, each path being of 150 million parameters, can surpass the performance of a 1 billion-parameter transformer model in terms of perplexity, while requiring less wall-clock time for training.

Key Findings

Synchronous vs. Partial Synchronization: DiPaCo, leveraging partial synchronization, presents a compelling case for its efficiency, showing similar or slightly better performance when compared to fully synchronized approaches.
Modular Scalability: The ability to scale the model by increasing the number of paths or the capacity of individual paths while maintaining similar computational footprints for training and inference illustrates the framework's flexibility.
Routing Flexibility: The framework's performance benefits from more sophisticated routing strategies, including discriminative routing and more frequent routing re-evaluation during inference.

Implications and Future Directions

DiPaCo's introduction is a step towards realizing highly scalable and modular learning systems. Its modular nature not only facilitates handling vast and diverse data sets but also enables continuous model expansion and adaptation without the need for retraining from scratch—highlighting potential for collaborative model building across different computational resources.

Theoretical Implications

At a theoretical level, DiPaCo's architecture challenges the prevailing norms of monolithic and synchronous model training, suggesting that modular and partially synchronized systems can achieve comparable or superior performance. This opens up new avenues of research into the convergence properties and efficiency of such distributed systems.

Practical Implications

Practically, DiPaCo offers a blueprint for future large-scale machine learning systems that can efficiently leverage distributed computational resources, including heterogeneous and geographically dispersed devices. This has significant implications for reducing the computational cost and environmental impact of training large models.

Concluding Remarks

DiPaCo represents a novel paradigm in distributed learning, emphasizing modularity, scalability, and efficiency. By successfully navigating the trade-offs between synchronization frequency, communication costs, and model performance, DiPaCo marks an important milestone towards more sustainable and scalable machine learning practices. Future developments in this field are poised to further dismantle the barriers to entry for engaging in large-scale model training and collaborative machine learning endeavors.