Revisiting BFloat16 Training (2010.06192v2)

Published 13 Oct 2020 in cs.LG and stat.ML

Abstract: State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design. We ask: can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training? Towards this end, we study 16-bit-FPU training on the widely adopted BFloat16 unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy. Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training. We demonstrate that these two techniques can enable up to 7% absolute validation accuracy gain in 16-bit-FPU training. This leads to 0.1% lower to 0.2% higher validation accuracy compared to 32-bit training across seven deep learning applications.

PDF Abstract

Revisiting BFloat16 Training

The paper "Revisiting BFloat16 Training" explores the feasibility of training deep learning models exclusively using 16-bit floating-point units (FPUs), specifically BFloat16, while achieving accuracy close to 32-bit precision. The focus lies in addressing the traditional notion that robust model training necessitates a mix of 16-bit and 32-bit precision.

Key Insights and Methodology

Current Use of Mixed Precision: Typically, mixed-precision training utilizes 16-bit precision for activations and gradients and 32-bit for model weights and optimizer states. This dual necessity requires hardware to support both 16-bit and 32-bit FPUs, which increases design complexity and cost.
Evaluation of Nearest Rounding: The authors demonstrate that nearest rounding in 16-bit FPUs can lead to the cancelation of small weight updates, notably degrading convergence and model accuracy. This cancellation primarily manifests in the late stages of training when updates are inherently small.
Numerical Techniques: To counteract these issues, two established numerical techniques are explored:
- Stochastic Rounding: This provides an unbiased estimation of weights, allowing continued convergence by probabilistically rounding weight updates.
- Kahan Summation: This compensates for rounding errors by tracking and counteracting summation errors with auxiliary values.

Empirical Findings

Validation Accuracy: Utilizing stochastic rounding or Kahan summation resulted in a validation accuracy within $0.1\%$ lower to $0.2\%$ higher than 32-bit training across seven evaluated models. The models included diverse applications such as image classification, natural language processing, and recommendation systems.
Training Efficiency: Standard 16-bit-FPU training without these techniques showed significant accuracy gaps, validating that the weight update process is the bottleneck of convergence in 16-bit training.

Theoretical Contributions

The paper presents a theoretical analysis showing how nearest rounding limits convergence due to the cancellation of model weight updates. These insights are confirmed by evaluating models with Lipschitz continuous gradients.

Implications for Hardware Design

The paper concludes that future deep learning hardware could exclusively employ 16-bit FPUs if stochastic rounding and Kahan summation are effectively supported both in hardware and software. This could lead to significant improvements in efficiency regarding power consumption, speed, and chip area.

Speculations on Future AI Developments

Future AI development could see a shift towards 16-bit-only FP hardware, pushing the boundaries of efficiency in large-scale model training. These findings imply that AI systems, especially those training LLMs or deep convolutional networks, could achieve reduced operational costs without sacrificing accuracy.

Overall, the research insights point toward evolving the landscape of model training practices and architectural design optimizations, ensuring maximum efficiency and effectiveness. The use of 16-bit FPUs, bolstered by robust numerical techniques, represents a significant forward step in practical AI implementations.