Spike No More: Stabilizing the Pre-training of Large Language Models (2312.16903v3)
Abstract: Loss spikes often occur during pre-training of LLMs. The spikes degrade the performance of LLMs and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.
- Layer normalization, 2016.
- Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, pp. 2397––2430, 2023.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–136, 2022.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33 (NeurIPS), pp. 1877–1901, 2020.
- Palm: Scaling language modeling with pathways, 2022.
- Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems 35 (NeurIPS), 2022.
- 8-bit optimizers via block-wise quantization. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.
- Cogview: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 19822–19835, 2021.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.
- Improving transformer optimization through better initialization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 4475–4483, 2020.
- Tying word vectors and word classifiers: A loss framework for language modeling. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- How to train BERT with an academic budget. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 10644–10652, 2021.
- Scaling laws for neural language models, 2020.
- Adam: A Method for Stochastic Optimization. In Proceedings of the third International Conference on Learning Representations (ICLR), 2015.
- What language model to train if you have one million GPU hours? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 765–782, 2022.
- The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 26736–26750, 2022.
- Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763, 2020.
- Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL), pp. 7297–7306, 2021.
- Pointer Sentinel Mixture Models. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–15, 2021.
- Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th International Conference on Spoken Language Translation (IWSLT), 2019.
- Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), pp. 1–9, 2018.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1525–1534, 2016.
- Cross+Self-Attention for transformer models, 2019.
- Post, M. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation (WMT), pp. 186–191, 2018.
- Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 157–163, 2017.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- Green AI. CoRR, abs/1907.10597, 2019.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1715–1725, 2016.
- Shazeer, N. Fast transformer decoding: One write-head is all you need, 2019.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Training very deep networks. In Advances in Neural Information Processing Systems 28 (NIPS), pp. 2377––2385, 2015.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645–3650, 2019.
- Rethinking perturbations in encoder-decoders for fast training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 5767–5780, 2021.
- Lessons on parameter sharing across layers in transformers. In Sadat Moosavi, N., Gurevych, I., Hou, Y., Kim, G., Kim, Y. J., Schuster, T., and Agrawal, A. (eds.), Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pp. 78–90, 2023.
- B2T connection: Serving stability and performance in deep transformers. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 3078–3095, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 5998–6008. 2017.
- Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021.
- Deepnet: Scaling transformers to 1,000 layers, 2022.
- Small-scale proxies for large-scale transformer training instabilities, 2023.
- On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 10524–10533, 2020.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
- Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
- Root mean square layer normalization. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
- Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 898–909, 2019a.
- Fixup initialization: Residual learning without normalization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019b.
- Opt: Open pre-trained transformer language models, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.