Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Deep Neural Networks with 8-bit Floating Point Numbers (1812.08011v1)

Published 19 Dec 2018 in cs.LG and stat.ML

Abstract: The state-of-the-art hardware platforms for training Deep Neural Networks (DNNs) are moving from traditional single precision (32-bit) computations towards 16 bits of precision -- in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation. Here we demonstrate, for the first time, the successful training of DNNs using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of Deep Learning models and datasets. In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding. The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4x improved throughput over today's systems.

An Evaluation of 8-bit Floating Point Training for Deep Neural Networks

The paper under review presents an intriguing advancement in the area of deep neural network (DNN) training, specifically focusing on the potential for utilizing 8-bit floating point (FP8) numbers. Historically, the reduction of numerical precision has been successfully employed in DNN inference to enhance energy efficiency and reduce computational demands. However, achieving the same with training processes has been notably more challenging due to the strict precision required in gradient calculations during back-propagation.

Key Innovations and Methodologies

The authors introduce several pivotal techniques to enable FP8-based training without compromising on model accuracy:

  1. Design of FP8 Format: A novel 8-bit floating point format (comprised of 1 sign bit, 5 exponent bits, and 2 mantissa bits) was devised to effectively represent weights, activations, and errors, despite the aggressive reduction in precision. This format was chosen based on an extensive analysis focused on balancing representation accuracy and dynamic range.
  2. Chunk-based Accumulation: This technique divides long vector dot-products into smaller segments or "chunks." Each chunk is accumulated separately in a higher precision (FP16) format, mitigating the "swamping" error characteristic of aggressive precision reductions. This hierarchical accumulation significantly curtails error propagation and offers a robust solution without incurring substantial hardware overhead.
  3. Floating Point Stochastic Rounding: Recognizing the information loss inherent in nearest rounding during bit-width reductions, the paper proposes stochastic rounding as an alternative. This technique probabilistically retains information from the least significant bits, improving numeric stability across accumulations.
  4. Reduction to FP16 arithmetic operations: In addition to 8-bit data representation, the paper explores reducing arithmetic precision for specific operations to 16 bits. The approach further economizes the power and area requirements of corresponding hardware implementations.

These innovations collectively facilitate DNN training using FP8 numbers for general matrix multiplication and convolution operations, with reduced arithmetic operations. The combination of these precision reductions purportedly offers a 2-4 times improvement in hardware throughput compared to contemporary systems.

Empirical Validation

The authors conduct extensive experimental evaluations across several standard deep learning datasets and models such as CIFAR10-CNN, CIFAR10-ResNet, ImageNet-ResNet18 and ResNet50, AlexNet, and BN50-DNN. Notably, the experimental results reveal that FP8 training achieves comparable accuracy to baseline FP32 models while significantly reducing storage and computation costs. For instance, on ImageNet, ResNet50 achieves a test error of 28.28% with FP8, against 27.86% in FP32, demonstrating a minimal accuracy trade-off.

Implications and Future Directions

The successful demonstration of FP8 training has substantial implications both for software-based optimizations and hardware advancements. The proposed techniques open the possibility for designing more energy-efficient and high-throughput computing platforms specialized for deep learning tasks. Furthermore, given the increasing deployment of deep learning models in resource-constrained environments, such as edge devices, the transition to reduced precision operations carries potential for cost savings and broader accessibility.

The work serves as a foundational step towards revisiting data formats and computational precision in neural network training environments, encouraging further exploration into hybrid precision methods. Future research may explore the refinement of these techniques, particularly in scaling precision settings across different layers or operations within a neural network. Moreover, the exploration of these methods could extend beyond vision domains to encompass diverse applications such as language processing and speech recognition.

In summary, this paper lays a convincing groundwork in showing that DNN training can move beyond 16-bit precision, retaining model accuracy while significantly enhancing hardware efficiency. The realizable improvements fortify the paper’s contributions to ongoing efforts in effective model training within the bounds of reduced computational and energy resources.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Naigang Wang (15 papers)
  2. Jungwook Choi (28 papers)
  3. Daniel Brand (4 papers)
  4. Chia-Yu Chen (7 papers)
  5. Kailash Gopalakrishnan (12 papers)
Citations (466)