Stable and low-precision training for large-scale vision-language models (2304.13013v2)
Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pages 1352–1361. PMLR, 2021.
- Binarybert: Pushing the limit of bert quantization. ArXiv, abs/2012.15701, 2021.
- Unit scaling: Out-of-the-box low-precision training. arXiv preprint arXiv:2303.11257, 2023.
- High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2005.14165.
- Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=Bkxe2AVtPS.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
- Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022.
- Dkm: Differentiable k-means clustering layer for neural network compression. arXiv preprint arXiv:2108.12659, 2021.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
- Binaryconnect: Training deep neural networks with binary weights during propagations. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3123–3131, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/3e15cc11f979ed25912dff5b0669f2cd-Abstract.html.
- Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
- Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009. https://ieeexplore.ieee.org/document/5206848.
- The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022a.
- 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022b.
- Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. https://arxiv.org/abs/2010.11929.
- Training dnns with hybrid block floating point. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 451–461, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/6a9aeddfc689c1d0e3b9ccc3ab651bc5-Abstract.html.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Training with quantization noise for extreme model compression. arXiv preprint arXiv:2004.07320, 2020.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Intriguing properties of transformer training instabilities. To appear.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. https://arxiv.org/abs/1512.03385.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. arxiv e-prints, art. arXiv preprint arXiv:1712.05877, 2017.
- Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
- I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2014. https://arxiv.org/abs/1412.6980.
- Dual patchnorm. arXiv preprint arXiv:2302.01327, 2023.
- Scaling language-image pre-training via masking. arXiv preprint arXiv:2212.00794, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2016. https://arxiv.org/abs/1608.03983.
- Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. https://openreview.net/forum?id=Bkg6RiCqY7.
- Mixed precision training with 8-bit floating point. CoRR, abs/1905.12334, 2019. URL http://arxiv.org/abs/1905.12334.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
- A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
- nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
- Combined scaling for zero-shot transfer learning, 2021. https://arxiv.org/abs/2111.10050.
- Binary neural networks: A survey. CoRR, abs/2004.03333, 2020. URL https://arxiv.org/abs/2004.03333.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021. https://arxiv.org/abs/2103.00020.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
- Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4901–4910, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/65fc9fb4897a89789352e211ca2d398f-Abstract.html.
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Training deep neural networks with 8-bit floating point numbers. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
- Training deep neural networks with 8-bit floating point numbers. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 7686–7695, 2018b. URL https://proceedings.neurips.cc/paper/2018/hash/335d3d1cd7ef05ec77714a215134914c-Abstract.html.
- Bfloat16: The secret to high performance on cloud tpus. 2019. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
- Mitchell Wortsman. Reaching 80% accuracy with openclip, 2023. https://laion.ai/blog/giant-openclip/.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
- Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Stabilizing transformer training by preventing attention entropy collapse. arXiv preprint arXiv:2303.06296, 2023a.
- Scaling vision transformers, 2021. https://arxiv.org/abs/2106.04560.
- Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023b.
- Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
- Susan Zhang. Open pretrained transformers lecture, 2023. https://www.youtube.com/watch?v=p9IxoSkvZ-M.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Ternarybert: Distillation-aware ultra-low bit bert. In EMNLP, 2020.
- Automatic mixed-precision quantization search of bert. arXiv preprint arXiv:2112.14938, 2021a.
- Distribution adaptive int8 quantization for training cnns. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021b.
- Trained ternary quantization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=S1_pAu9xl.
- Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1979, 2020.