One weird trick for parallelizing convolutional neural networks (1404.5997v2)

Published 23 Apr 2014 in cs.NE, cs.DC, and cs.LG

Abstract: I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.

Authors (1)

Alex Krizhevsky (4 papers)

Citations (1,249)

View on Semantic Scholar

Summary

The paper introduces a hybrid parallelization method that optimizes CNN training by applying data parallelism to convolutional layers and model parallelism to fully-connected layers.
It employs schemes that aggregate large batches and overlap communication, reducing training time from 98.05 hours on a single GPU to 15.7 hours on eight GPUs.
Experimental results show only a slight increase in top-1 error from 42.33% to 42.86%, demonstrating effective scaling with minimal accuracy degradation.

Algorithmic Development for Efficient Parallelization of Convolutional Neural Networks

The paper by Alex Krizhevsky introduces a novel method for parallelizing the training process of Convolutional Neural Networks (CNNs) across multiple GPUs. The proposed approach intelligently combines both data parallelism and model parallelism to leverage the strengths of each, achieving superior scaling performance compared to existing methods.

Introduction

Training CNNs with Stochastic Gradient Descent (SGD) on large datasets is computationally intensive, making it necessary to employ parallelization strategies. The common approaches are model parallelism, where different parts of the model are trained by different workers, and data parallelism, where different data batches are processed independently by different workers. The paper proposes an innovative hybrid approach that enhances the efficiency of training by differentially applying these schemes to different types of layers within the neural network.

Proposed Method

The key insight is to parallelize convolutional and fully-connected layers differently, according to their computational characteristics:

Convolutional Layers: These layers account for the bulk of computational load but have relatively fewer parameters. Therefore, data parallelism is more effective because it allows distributing multiple data batches across GPUs, minimizing inter-GPU communication.
Fully-Connected Layers: These layers contain a large number of parameters but require less computation per operation. Model parallelism is better suited here, as it focuses on distributing the parameter load across GPUs, which helps in managing the memory constraints effectively.

Detailed Algorithm

The paper details two primary algorithms for transitioning between data parallelism (for convolutional layers) to model parallelism (for fully-connected layers):

Scheme (a): Assemble a large batch of $128K$ examples by collecting batches from all GPUs before performing computation in fully-connected layers. Despite high memory usage, this method ensures efficient GPU utilization.
Scheme (b): GPUs take turns sending their last convolutional layer's output to all other GPUs. This scheme hides communication latency by overlapping it with computation, achieving better utilization of resources.
Scheme (c): Similar to Scheme (b) but distributes the communication more evenly among the GPUs, making it scale better with the number of GPUs by maintaining a constant communication-to-computation ratio.

In the backward pass, gradients are computed and exchanged accordingly, ensuring coherent updates across all GPUs.

Experimental Setup and Results

The experimentation focuses on the ImageNet 2012 dataset using a slightly modified version of the winning model from ILSVRC 2012. The experiments illustrate that larger batch sizes do incur some accuracy cost. However, this can be mitigated using a variable batch size technique, which applies a larger effective batch size in convolutional layers and smaller batches in fully-connected layers, enhancing convergence rates and final solution quality.

The experiments were conducted on a machine with eight NVIDIA K20 GPUs, showing substantial training time reductions with relatively minor accuracy compromises. For instance, training time on 8 GPUs reduced to approximately 15.7 hours compared to 98.05 hours on a single GPU, with top-1 error increasing marginally from 42.33% to 42.86%.

Comparison with Related Work

The paper benchmarks its method against existing works. Compared to \cite{yadan2013multi} and \cite{paine2013gpu}, the proposed hybrid scheme demonstrates better scaling with fewer communication bottlenecks and lower accuracy degradation. The approach stands out particularly in environments with constrained inter-GPU communication bandwidth.

Implications and Future Work

The results emphasize the importance of layer-specific parallelization strategies, suggesting that purely synchronous or asynchronous SGD methods can be enhanced by hybrid schemes tailored to the computational profile of different layers. Future development could focus on optimizing fully-connected layer architectures for parallelism, exploring hybrid schemes beyond Scheme (c), and adapting network architectures to better fit multi-GPU training environments.

Adopting such specialized parallelization strategies could significantly expedite deep learning research and development, particularly for large-scale models and datasets. The paper paves the way for more sophisticated and efficient multi-GPU training methodologies, which will be crucial as models and datasets continue to grow in complexity and size.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ekdnam/status/1863370760200286661