Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stable and low-precision training for large-scale vision-language models (2304.13013v2)

Published 25 Apr 2023 in cs.LG and cs.CV

Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pages 1352–1361. PMLR, 2021.
  2. Binarybert: Pushing the limit of bert quantization. ArXiv, abs/2012.15701, 2021.
  3. Unit scaling: Out-of-the-box low-precision training. arXiv preprint arXiv:2303.11257, 2023.
  4. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2005.14165.
  6. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=Bkxe2AVtPS.
  7. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  8. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  9. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022.
  10. Dkm: Differentiable k-means clustering layer for neural network compression. arXiv preprint arXiv:2108.12659, 2021.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
  13. Binaryconnect: Training deep neural networks with binary weights during propagations. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3123–3131, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/3e15cc11f979ed25912dff5b0669f2cd-Abstract.html.
  14. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  15. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009. https://ieeexplore.ieee.org/document/5206848.
  16. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  17. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022a.
  18. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022b.
  19. Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. https://arxiv.org/abs/2010.11929.
  21. Training dnns with hybrid block floating point. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 451–461, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/6a9aeddfc689c1d0e3b9ccc3ab651bc5-Abstract.html.
  22. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  23. Training with quantization noise for extreme model compression. arXiv preprint arXiv:2004.07320, 2020.
  24. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  25. Intriguing properties of transformer training instabilities. To appear.
  26. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  27. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  28. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. https://arxiv.org/abs/1512.03385.
  29. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  30. Quantization and training of neural networks for efficient integer-arithmetic-only inference. arxiv e-prints, art. arXiv preprint arXiv:1712.05877, 2017.
  31. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
  32. I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
  33. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2014. https://arxiv.org/abs/1412.6980.
  34. Dual patchnorm. arXiv preprint arXiv:2302.01327, 2023.
  35. Scaling language-image pre-training via masking. arXiv preprint arXiv:2212.00794, 2022.
  36. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2016. https://arxiv.org/abs/1608.03983.
  37. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. https://openreview.net/forum?id=Bkg6RiCqY7.
  38. Mixed precision training with 8-bit floating point. CoRR, abs/1905.12334, 2019. URL http://arxiv.org/abs/1905.12334.
  39. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  40. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  41. A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
  42. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  43. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
  44. Combined scaling for zero-shot transfer learning, 2021. https://arxiv.org/abs/2111.10050.
  45. Binary neural networks: A survey. CoRR, abs/2004.03333, 2020. URL https://arxiv.org/abs/2004.03333.
  46. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021. https://arxiv.org/abs/2103.00020.
  47. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  48. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  50. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  54. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  55. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
  56. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4901–4910, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/65fc9fb4897a89789352e211ca2d398f-Abstract.html.
  57. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  58. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  60. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  61. Training deep neural networks with 8-bit floating point numbers. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
  62. Training deep neural networks with 8-bit floating point numbers. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 7686–7695, 2018b. URL https://proceedings.neurips.cc/paper/2018/hash/335d3d1cd7ef05ec77714a215134914c-Abstract.html.
  63. Bfloat16: The secret to high performance on cloud tpus. 2019. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
  64. Mitchell Wortsman. Reaching 80% accuracy with openclip, 2023. https://laion.ai/blog/giant-openclip/.
  65. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  66. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  67. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  68. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  69. Stabilizing transformer training by preventing attention entropy collapse. arXiv preprint arXiv:2303.06296, 2023a.
  70. Scaling vision transformers, 2021. https://arxiv.org/abs/2106.04560.
  71. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023b.
  72. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
  73. Susan Zhang. Open pretrained transformers lecture, 2023. https://www.youtube.com/watch?v=p9IxoSkvZ-M.
  74. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  75. Ternarybert: Distillation-aware ultra-low bit bert. In EMNLP, 2020.
  76. Automatic mixed-precision quantization search of bert. arXiv preprint arXiv:2112.14938, 2021a.
  77. Distribution adaptive int8 quantization for training cnns. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021b.
  78. Trained ternary quantization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=S1_pAu9xl.
  79. Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1979, 2020.
Citations (27)

Summary

  • The paper introduces SwitchBack, a novel linear layer that enables int8 quantized training with up to 25% speedup while preserving accuracy within 0.1%.
  • The paper combines AdamW and Adafactor techniques in a hybrid optimizer that effectively mitigates loss spikes during training.
  • The study’s experiments on CLIP models demonstrate enhanced stability and computational efficiency in managing large-scale vision-language challenges.

Stable and Low-Precision Training for Large-Scale Vision-LLMs

The paper introduces methods to enhance the efficiency and robustness of training large vision-LLMs, emphasizing the advancements for low-precision training. The primary focus is on providing computational speedups and stabilizing training to prevent common issues such as loss spikes.

Key Contributions

The researchers propose two key innovations for accelerating training:

  1. SwitchBack Linear Layer for int8 Quantized Training:
    • The team introduces SwitchBack, a novel linear layer designed for int8 quantized operations. Using int8 computations in matrix multiplications, particularly during weight gradient computation, SwitchBack achieves a training speedup ranging from 13% to 25% for large models like CLIP ViT-Huge. This approach closely matches the accuracy achieved with bfloat16 training within a 0.1% margin.
    • The innovation hinges on the observation that quantization noise grows with the inner dimension of matrix multiplication. Consequently, int8 precision is used predominantly while switching to higher bit precision for crucial operations to ensure stability.
  2. Hybrid AdamW-Adafactor Optimizer for Training Stability:
    • The paper highlights a common issue of loss spikes during training, identified as occurring after the underestimation of squared gradients by the AdamW second moment estimator. To counter this, the authors recommend a hybrid optimization technique combining AdamW with elements from the Adafactor algorithm, specifically involving update clipping mechanisms. This hybrid approach successfully mitigates loss spikes, outperforming traditional methods like gradient clipping.

Experimental Setup and Results

The paper presents comprehensive experiments demonstrating the effectiveness of their approach. Training was conducted using CLIP-style models on substantial datasets to simulate real-world constraints while focusing on enhancing training methodologies. Notable findings include:

  • Int8 Training: In extensive tests, including comparisons against alternative approaches like LLM.int8(), SwitchBack demonstrated superior accuracy preservation with significant reductions in computational costs. Notably, the implementation showed notable speed advantages, particularly for larger dimensions and batch sizes.
  • Stability through Update Clipping: By analyzing various scales of CLIP models and systematically altering batch sizes, learning rates, and model dimensions, the paper convincingly shows that adjusting the second moment estimator prevents loss spikes, ensuring smoother convergence and overall better training stability.

Theoretical Insights and Future Implications

The insights from this research highlight crucial theoretical aspects of low-precision and large-scale model training. In particular, the handling of quantization noise and adaptively managing optimizer parameters based on gradient signal changes present robust solutions for commonly faced challenges in scaling up model sizes.

For the future, these results suggest pathways to further refine training protocols and optimize hardware utilizations in scalable AI systems. By equipping researchers with tools like SwitchBack and adaptive optimizers, this paper pushes the envelope of what's achievable with modern vision-LLMs, potentially extending to other architectures and applications in artificial intelligence.

In summary, the paper makes vital contributions to the field of efficient model training, providing both theoretical frameworks and practical innovations that address pressing computational challenges.