1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed (2104.06069v2)

Published 13 Apr 2021 in cs.LG and cs.DC

Abstract: To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB.

PDF Abstract

An Analysis of "1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed"

The paper "1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed" introduces a novel approach to distributed training of large-scale neural networks by addressing a key bottleneck: communication overhead. On systems with constrained network bandwidth, such as those using TCP, this overhead can substantially hinder training efficiency. Two strategies previously proposed to mitigate this issue are large batch-size optimization via the LAMB algorithm and communication compression using techniques like 1-bit Adam. This paper combines these strategies to form the 1-bit LAMB algorithm, yielding promising results.

Key Contributions

Algorithmic Innovation: 1-bit LAMB:
- The paper introduces 1-bit LAMB, a two-stage algorithm. Initially, LAMB is used during a warmup stage to "pre-condition" the model. Subsequently, the core contribution is leveraging a compressed communication strategy combined with LAMB's layerwise adaptive learning rates during the compression stage. This method adapts to LAMB's layerwise learning rates despite using 1-bit compression techniques.
System-Level Implementation:
- The authors have integrated a new backend for compressed communication using the NCCL library within PyTorch distributed. This development not only enhances usability compared to prior MPI-based implementations but also achieves comparable or superior performance, particularly on Ethernet networks.
Empirical Evaluation:
- The paper conducts comprehensive experiments with BERT-Large pre-training tasks on clusters of varying GPU counts. The results show a reduction in communication volume by up to 4.6 times and end-to-end time-wise speedup by up to 2.8 times when using 1-bit LAMB, without sacrificing convergence speed or fine-tuning task accuracy.

Numerical Results and Implications

The strong empirical performance reported in the paper highlights several implications for distributed deep learning:

Scalability: The efficient handling of communications makes 1-bit LAMB well-suited for scaling up to large GPU clusters, which is increasingly relevant as models grow in size.
Practical Usability: By reducing communication overhead significantly, clusters with commodity hardware and network configurations can be better utilized, broadening the accessibility of large-scale training.
Adaption of Compression: The method demonstrates the ability to effectively integrate adaptive layerwise learning rates with compression, which has broader implications for optimizing communication in other adaptive optimizers.

Theoretical Considerations

The paper also outlines a theoretical framework underpinning 1-bit LAMB's convergence properties. It provides a convergence guarantee that aligns with uncompressed SGD's asymptotic performance, suggesting the validity of applying compression techniques while maintaining robust convergence characteristics.

Future Directions

The paper opens several avenues for future research:

Algorithmic Extensions: Extending the approach to other adaptive optimizers beyond LAMB could enhance its applicability across various architectures and domains.
Further Optimizations: Investigating different communication compression techniques could yield additional performance gains and potentially simplify implementation.
Real-world Applications: Applying 1-bit LAMB in diverse, large-scale machine learning applications could uncover further insights and areas for improvement.

In summary, the 1-bit LAMB algorithm provides a substantial enhancement in the efficient training of large-scale models, particularly in environments with constrained network capabilities. Its strategic integration of large-batch optimization with communication compression represents a significant step forward in distributed training methodologies. The paper effectively combines theoretical rigor with experimental validation, contributing meaningfully to the field of scalable deep learning systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Conglong Li (15 papers)
Ammar Ahmad Awan (15 papers)
Hanlin Tang (34 papers)
Samyam Rajbhandari (21 papers)
Yuxiong He (59 papers)

Citations (31)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. (33,281 stars)