Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes (1807.11205v1)

Published 30 Jul 2018 in cs.LG, cs.DC, and stat.ML

Abstract: Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems.

PDF Abstract

Overview of the Highly Scalable Deep Learning Training System

The discussed paper presents a highly scalable deep learning training system aimed at achieving rapid training times on dense GPU clusters, particularly focusing on minimizing training durations for ImageNet classification models such as AlexNet and ResNet-50. The authors propose a multifaceted approach to enhance both computational throughput and system scalability, mainly through mixed-precision arithmetic and optimized communication strategies.

Key Contributions and Methodologies

This paper's primary contributions can be broken down as follows:

Mixed-Precision Training: The system exploits mixed-precision training to maximize GPU throughput, utilizing half-precision (FP16) arithmetic for forward and backward passes while maintaining a full-precision (FP32) master copy for weights and gradients. This method retains accuracy while boosting computational efficiency.
Large Mini-Batch Optimization: With an innovative approach, the authors successfully scale mini-batch sizes up to 64K for AlexNet and ResNet-50 without loss of accuracy. This is accomplished by applying techniques such as Layer-wise Adaptive Rate Scaling (LARS), eliminating weight decay on batch normalization parameters, and carefully inserting batch normalization layers to prevent underfitting and maintain model generalization.
Optimized All-Reduce Algorithms: To address the communication bottlenecks inherent in distributed training, the authors introduce a novel hybrid all-reduce strategy that incorporates tensor fusion and hierarchical communication methods. These enhancements markedly increase the scalability and efficiency of gradient aggregation across extensive GPU deployments.

Experimental Results

The performance improvements achieved by this system are significant:

The training system attains a remarkable 75.8% top-1 accuracy on ResNet-50 within 6.6 minutes using 2048 Tesla P40 GPUs. Notably, this surpasses prior state-of-the-art results both in speed and accuracy.
For AlexNet, the system accomplishes 58.7% top-1 accuracy in 4 minutes with 1024 GPUs, again setting a new benchmark for training speed and efficiency.

The experimental findings underscore the effectiveness of the system, particularly in large-scale distributed environments that necessitate high throughput and efficiency.

Theoretical and Practical Implications

The implications of this research extend to both theoretical and practical domains. Theoretically, the integration of mixed-precision with LARS contributes to the understanding of maintaining model accuracy with aggressive hardware-based optimizations. Practically, the proposed system exemplifies how to harness GPU clusters for highly efficient deep learning training, thus reducing entry barriers for large-scale AI research and industry implementations by significantly lowering the time and resource costs associated with model training.

Future Directions

This work opens up several avenues for further exploration. Future research could further refine communication strategies to handle increasingly large model parameters more effectively or extend the techniques applied here to other neural network architectures or tasks beyond image classification. Additionally, exploring the interaction of mixed-precision training with other adaptive learning rate schedulers and distributed optimization algorithms could yield even more robust and scalable solutions.

In conclusion, this paper makes substantial contributions to the field of distributed deep learning by proposing scalable strategies that successfully enhance both computational and communication efficiencies in dense GPU clusters. These contributions have meaningful implications for accelerating AI research and deployment in computationally intensive environments.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Xianyan Jia (11 papers)
Shutao Song (2 papers)
Wei He (188 papers)
Yangzihao Wang (7 papers)
Haidong Rong (2 papers)
Feihu Zhou (4 papers)
Liqiang Xie (2 papers)
Zhenyu Guo (21 papers)
Yuanzhou Yang (1 paper)
Liwei Yu (1 paper)
Tiegang Chen (1 paper)
Guangxiao Hu (2 papers)
Shaohuai Shi (47 papers)
Xiaowen Chu (108 papers)

Citations (369)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos