2BP: 2-Stage Backpropagation (2405.18047v1)

Published 28 May 2024 in cs.LG, cs.AI, and cs.DC

Abstract: As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

PDF HTML Abstract

A Novel Approach to Enhancing Pipeline Parallelism in Deep Neural Network Training

Overview

The paper introduces 2-stage backpropagation (2BP), a novel methodology aimed at mitigating the bottlenecks impeding current implementations of pipeline parallelism in DNN training. By decomposing the backpropagation step into two distinct stages—backward-p1 and backward-p2—the authors provide a systematic approach to reducing idle compute time. This technique is intended to enhance throughput efficiency on large-scale DNN models when distributed across multiple accelerators.

Core Contributions and Methodology

The primary contribution of this paper is the conceptual and practical development of 2BP, which modifies the traditional backpropagation algorithm employed by ML frameworks such as PyTorch and TensorFlow. Traditional pipeline parallelism methods are constrained by the sequential computation of gradients, leading to significant idle times across accelerators—an issue specifically addressed by breaking down the backward pass.

In detail, the backward-p1 stage computes the gradient of the loss with respect to the intermediate activations ( $\frac{\partial L}{\partial z_{l-1}}$ ), while the backward-p2 stage computes the gradient with respect to the parameters of the layer ( $\frac{\partial L}{\partial w_{l}}$ ). By intentionally delaying the backward-p2 computation until the backward-p1 stage has been computed on the subsequent accelerator, the paper demonstrates significant reductions in compute idle time which are quantified as decreases in the bubble ratio.

A custom implementation in PyTorch was crafted to disable the usual torch.autograd functionalities, facilitating finer control over the backward pass operation. Various model architectures—including transformer-based models like LLaMa and BERT, and models with non-uniform computational graphs such as ResNet—were benchmarked to validate the effectiveness of the proposed 2BP method.

Empirical Results

The paper provides rigorous empirical evidence showcasing throughput improvements with the deployment of 2BP across different pipeline schedules. For instance, the use of 2BP yielded a 1.70x increase in throughput when training a 7 billion parameter LLaMa-like transformer across 4 GPUs under the 1F1B-1 schedule. Across other model architectures like BERT-Large and ResNet-152, throughput enhancements ranged from 1.10x to 1.26x.

Memory Considerations: Notably, the application of 2BP incurs increased memory consumption due to the need to store intermediate activations and derivatives longer. For instance, the Mamba-1.4B model with the 1F1B-2 schedule experienced a 2.67x increase in memory usage. The authors acknowledge this trade-off and suggest potential memory optimization strategies like intermediate derivative checkpointing and storage offloading to host or NVMe memory.

Scaling and Robustness

Scalability tests were also undertaken, highlighting that the performance gains of 2BP tend to reduce slightly with increasing number of GPUs due to elevated communication overheads, particularly during inter-node communication phases. Specifically, in the experiments with variable global model size, the throughput gains showed a modest decrease as the number of GPUs increased from 4 to 16.

Future Work and Implications

The implications of this research include enhancements in the training efficiency of large DNNs, thus fostering faster and more scalable models, particularly in resource-constrained environments. The proposed 2BP method can be pivotal for future advancements in optimizing pipeline parallelism.

The paper outlines several avenues for future work:

Implementation of intermediate derivative checkpointing to reduce memory consumption.
Exploration of storage offloading techniques to balance memory and compute performance.
Further investigation into the unified implementation of data parallelism with 2BP to optimize communication overlaps.

The suggestion to expose backward pass functionalities in ML frameworks like PyTorch for more granular control is a significant call to action, potentially facilitating broader adoption and customization of the proposed and similar methodologies.

In conclusion, the introduction of 2BP represents a significant methodological enhancement in the domain of pipeline parallelism for DNNs. While the paper underscores the need for further memory optimization, the empirical results and theoretical analyses solidly confirm the efficacy of 2BP in improving training throughput across multiple architectures and pipeline schedules. This stands as a promising strategy for the ever-growing field of large-scale AI model training.