Training Deep Nets with Sublinear Memory Cost (1604.06174v2)

Published 21 Apr 2016 in cs.LG

Abstract: We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

PDF Abstract

Training Deep Nets with Sublinear Memory Cost

The paper by Chen et al. proposes a novel approach to significantly reduce the memory consumption during the training of deep neural networks. The primary contribution lies in the development of an algorithm that reduces memory usage to $O(\sqrt{n})$ for training a network with $n$ layers, requiring only the computational overhead of an additional forward pass per mini-batch. This is particularly relevant as deep learning models grow in depth and complexity, often exceeding the memory limits of GPUs.

Key Contributions

Reduction of Memory Consumption: The authors introduce an algorithm that minimizes memory cost associated with storing intermediate feature maps and gradients during training. By employing computation graph analysis, the algorithm optimizes in-place operations and memory sharing.
Trade-off between Computation and Memory: The paper details a strategy to trade computation for memory savings. By recomputing certain intermediate data during the backpropagation phase, the memory requirement can be reduced to $O(\log n)$ in extreme cases, albeit with an associated increase in forward computation cost to $O(n \log n)$ .
Experimental Validation: Experiments demonstrate substantial memory savings. For instance, a 1,000-layer deep residual network saw memory usage drop from 48GB to 7GB on ImageNet tasks. Similar benefits were observed for complex recurrent neural networks handling very long sequences.

Detailed Insights

The authors delve into the use of computation graphs and liveness analysis, methodologies originally developed for compiler optimizations, to facilitate memory allocation improvements in deep learning frameworks. By employing techniques such as in-place operations and memory sharing, memory usage is minimized without compromising computational efficiency significantly.

Computation Graph Optimization

The computation graph encapsulates the operations and their dependencies within a neural network. By traversing this graph and judiciously applying memory optimizations, it is possible to share memory between nodes whose lifetimes do not overlap. This prevents the unnecessary allocation of memory, instead recycling it where feasible.

Segment-based Memory Optimization

One pivotal idea presented is to divide the computation into segments, storing outputs of these segments temporarily and recomputing the intermediate results within each segment during backpropagation. This is articulated through an algorithm that dynamically plans memory allocation based on a user-specified budget, balancing the memory used for feature maps and computational steps. The optimal segment size found through this method enables sublinear memory usage with minimal computational overhead.

The practical implementation of these ideas in a deep learning framework like MXNet illustrates clear memory reductions. The authors compare several memory allocation strategies, revealing that the proposed sublinear plan significantly outperforms traditional methods, even those employing in-place operations and memory sharing.

Implications and Future Directions

The paper's findings have profound implications for the scalability of deep learning models. By allowing much deeper networks to be trained within existing memory constraints, the algorithm opens up possibilities for more complex models capable of capturing intricate patterns in large datasets. From a practical perspective, this could improve the performance of models in domains such as computer vision, speech recognition, and natural language processing.

Moreover, the trade-off between computation and memory offers flexibility, enabling researchers to make informed choices based on their computational resources and the specific characteristics of their neural network architectures.

Speculation on Future Directions

Looking ahead, the methods proposed could be integrated into more deep learning frameworks, potentially becoming standard practice for training deep neural networks. Further optimization could focus on reducing the overheads associated with recomputation, perhaps leveraging advancements in hardware accelerators or more sophisticated parallelization techniques. Additionally, expanding the applicability of these methods to other types of neural networks and training paradigms could further amplify their impact.

In conclusion, this paper provides a systematic approach to address the critical challenge of memory consumption in training deep neural networks. By introducing a sublinear memory cost algorithm with minimal computational overhead increase, the methods discussed pave the way for more efficient and scalable neural network training, thereby pushing the boundaries of what is feasible within deep learning research.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Tianqi Chen (77 papers)
Bing Xu (66 papers)
Chiyuan Zhang (57 papers)
Carlos Guestrin (57 papers)

Citations (1,054)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos