Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Stable and low-precision training for large-scale vision-language models (2304.13013v2)

Published 25 Apr 2023 in cs.LG and cs.CV

Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.

References (79)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces SwitchBack, a novel linear layer that enables int8 quantized training with up to 25% speedup while preserving accuracy within 0.1%.
The paper combines AdamW and Adafactor techniques in a hybrid optimizer that effectively mitigates loss spikes during training.
The study’s experiments on CLIP models demonstrate enhanced stability and computational efficiency in managing large-scale vision-language challenges.

Stable and Low-Precision Training for Large-Scale Vision-LLMs

The paper introduces methods to enhance the efficiency and robustness of training large vision-LLMs, emphasizing the advancements for low-precision training. The primary focus is on providing computational speedups and stabilizing training to prevent common issues such as loss spikes.

Key Contributions

The researchers propose two key innovations for accelerating training:

SwitchBack Linear Layer for int8 Quantized Training:
- The team introduces SwitchBack, a novel linear layer designed for int8 quantized operations. Using int8 computations in matrix multiplications, particularly during weight gradient computation, SwitchBack achieves a training speedup ranging from 13% to 25% for large models like CLIP ViT-Huge. This approach closely matches the accuracy achieved with bfloat16 training within a 0.1% margin.
- The innovation hinges on the observation that quantization noise grows with the inner dimension of matrix multiplication. Consequently, int8 precision is used predominantly while switching to higher bit precision for crucial operations to ensure stability.
Hybrid AdamW-Adafactor Optimizer for Training Stability:
- The paper highlights a common issue of loss spikes during training, identified as occurring after the underestimation of squared gradients by the AdamW second moment estimator. To counter this, the authors recommend a hybrid optimization technique combining AdamW with elements from the Adafactor algorithm, specifically involving update clipping mechanisms. This hybrid approach successfully mitigates loss spikes, outperforming traditional methods like gradient clipping.

Experimental Setup and Results

The paper presents comprehensive experiments demonstrating the effectiveness of their approach. Training was conducted using CLIP-style models on substantial datasets to simulate real-world constraints while focusing on enhancing training methodologies. Notable findings include:

Int8 Training: In extensive tests, including comparisons against alternative approaches like LLM.int8(), SwitchBack demonstrated superior accuracy preservation with significant reductions in computational costs. Notably, the implementation showed notable speed advantages, particularly for larger dimensions and batch sizes.
Stability through Update Clipping: By analyzing various scales of CLIP models and systematically altering batch sizes, learning rates, and model dimensions, the paper convincingly shows that adjusting the second moment estimator prevents loss spikes, ensuring smoother convergence and overall better training stability.

Theoretical Insights and Future Implications

The insights from this research highlight crucial theoretical aspects of low-precision and large-scale model training. In particular, the handling of quantization noise and adaptively managing optimizer parameters based on gradient signal changes present robust solutions for commonly faced challenges in scaling up model sizes.

For the future, these results suggest pathways to further refine training protocols and optimize hardware utilizations in scalable AI systems. By equipping researchers with tools like SwitchBack and adaptive optimizers, this paper pushes the envelope of what's achievable with modern vision-LLMs, potentially extending to other architectures and applications in artificial intelligence.

In summary, the paper makes vital contributions to the field of efficient model training, providing both theoretical frameworks and practical innovations that address pressing computational challenges.

PDF Markdown

Tweets

https://twitter.com/Thom_Wolf/status/1826924774997532799

https://twitter.com/gaunernst/status/1834221330390290807

https://twitter.com/_clashluke/status/1812938241831579990

https://twitter.com/dlwh/status/1938282087456686400

https://twitter.com/fernandp/status/1816525716210753888

https://twitter.com/2292004728/status/1742239402598765013