Mixed Precision Training (1710.03740v3)

Published 10 Oct 2017 in cs.AI, cs.LG, and stat.ML

Abstract: Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.

Citations (1,645)

View on Semantic Scholar

Summary

The paper demonstrates that mixed precision training with a FP32 master copy preserves model accuracy comparable to FP32 training.
The methodology employs loss scaling and FP16 arithmetic with FP32 accumulation to effectively mitigate underflow and precision loss.
The approach reduces memory and computation costs, enabling scalable training across CNNs, RNNs, and GANs on large datasets.

Mixed Precision Training: An Overview

The paper "Mixed Precision Training" by Sharan Narang et al. presents a systematic approach to training deep neural networks (DNNs) using half-precision floating point numbers (FP16), while maintaining model accuracy on par with traditional single-precision (FP32) training. The primary drivers of this work are the substantial memory and computational advantages offered by reduced precision arithmetic, especially in the context of the rapid growth in model sizes and dataset scales. This essay provides an in-depth analysis of the methodologies proposed and their implications in contemporary deep learning research.

Techniques Implemented for Mixed Precision Training

The authors introduce three pivotal techniques to mitigate accuracy loss while leveraging FP16 precision:

Single-Precision Master Copy of Weights:
- Weights, activations, and gradients are stored in FP16, while an FP32 master copy of weights is maintained and updated with the gradients. This is crucial as it prevents significant weight updates from becoming zero, which could detrimentally affect training convergence.
Loss Scaling:
- Given FP16’s narrower dynamic range compared to FP32, small gradient values are prone to underflow and become zero. Loss scaling involves multiplying the loss by a factor before back-propagation, thus "shifting" gradient values into FP16’s representable range. After back-propagation, gradients are scaled back to their original range before weight updates.
FP16 Arithmetic with FP32 Accumulation:
- FP16 products are accumulated into FP32 values to prevent precision loss during the accumulation process. This method ensures that the results of operations critical to the network’s learning process are not compromised by FP16's limited precision.

Evaluations and Results

The paper demonstrates the efficacy of mixed precision training across a diverse set of tasks and model architectures including convolutional neural networks (CNNs) for image classification and object detection, recurrent neural networks (RNNs) for speech recognition and machine translation, and generative adversarial networks (GANs) for image synthesis.

Image Classification:
- Various CNN architectures such as AlexNet, VGG, and ResNet were trained on the ILSVRC dataset. The results indicated that mixed precision training maintains accuracy comparable to FP32, without any modification to hyper-parameters.
Object Detection:
- For Faster R-CNN and Multibox SSD models, mixed precision training yielded mean average precision (mAP) metrics akin to FP32 training. Notably, SSD training required the loss-scaling technique to prevent divergence and maintain performance.
Speech Recognition:
- Training the DeepSpeech 2 model on English and Mandarin datasets demonstrated that mixed precision training could effectively handle large models with extensive time-steps. Interestingly, mixed precision training sometimes achieved slightly better character error rates (CER) than FP32, suggesting a potential regularization effect.
Machine Translation and LLMing:
- LSTM-based models for language translation and the bigLSTM model for LLMing also performed reliably under mixed precision training, provided that loss scaling was applied to counteract gradient underflow.
Generative Adversarial Networks (GANs):
- The DCGAN model for face generation demonstrated that mixed precision training could produce high-quality images with comparable visual fidelity to those generated by FP32 training.

Practical and Theoretical Implications

The proposed mixed precision training methodology substantially reduces memory requirements and computational time for training deep learning models. This reduction results from both the lower memory bandwidth utilization and the higher throughput for FP16 arithmetic operations on contemporary GPUs. The practical upshot is that larger and more complex models can be feasibly trained, making it a significant enabler for further advancements in AI research. Moreover, the paper’s adaptability across various tasks underscores its robustness and generalizability.

Future Directions

Future research could extend these methodologies to other domains, such as generative models in text-to-speech systems and deep reinforcement learning. Automating the selection and adjustment of the loss-scaling factor during training could enhance usability and robustness, eliminating the need for empirical parameter tuning.

In conclusion, "Mixed Precision Training" sets a foundational precedent for efficient and scalable model training without compromising performance, fortifying its place in the evolving landscape of deep learning techniques.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AryanPa66861306/status/1888732687755788614

YouTube

Show All Videos