Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck (2404.07647v1)
Abstract: Recent advances in LLMing consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
- The falcon series of open language models, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Too much in common: Shifting of embeddings in transformer language models and its implications. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5117–5130, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.403. URL https://aclanthology.org/2021.naacl-main.403.
- Language models are few-shot learners, 2020.
- Intrinsic dimension estimation: Advances and open problems. Information Sciences, 328:26–41, 2016. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2015.08.029. URL https://www.sciencedirect.com/science/article/pii/S0020025515006179.
- Softmax bottleneck makes language models unable to represent multi-mode word distributions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8048–8073, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.554. URL https://aclanthology.org/2022.acl-long.554.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006.
- Croissantllm: A truly bilingual french-english language model, 2024.
- Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkEYojRqtm.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Anisotropy is inherent to self-attention in transformers, 2024.
- Training compute-optimal large language models, 2022.
- Mistral 7b, 2023.
- Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YevsQ05DEN7.
- Sigsoftmax: Reanalysis of the softmax bottleneck. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/9dcb88e0137649590b755372b040afad-Paper.pdf.
- Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020. URL https://api.semanticscholar.org/CorpusID:210861095.
- N. Kishore Kumar and J. Schneider. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 65(11):2212–2244, 2017. doi: 10.1080/03081087.2016.1267104. URL https://doi.org/10.1080/03081087.2016.1267104.
- Mitigating data imbalance and representation degeneration in multilingual machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14279–14294, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.953. URL https://aclanthology.org/2023.findings-emnlp.953.
- Ying-Chen Lin. Breaking the softmax bottleneck for sequential recommender systems with dropout and decoupling, 2021.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- A natural bias for language generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 243–255, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.22. URL https://aclanthology.org/2023.acl-short.22.
- Pointer sentinel mixture models, 2016.
- Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022. doi: 10.1162/tacl˙a˙00493. URL https://aclanthology.org/2022.tacl-1.49.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
- The impact of depth and width on transformer language model generalization, 2023.
- Outlier dimensions that disrupt transformers are driven by frequency. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1286–1304, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.93.
- Language models are unsupervised multitask learners. 2019.
- A cluster-based approach for improving isotropy in contextual embedding space. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 575–584, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.73. URL https://aclanthology.org/2021.acl-short.73.
- An isotropy analysis in the multilingual BERT embedding space. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 1309–1316, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.103. URL https://aclanthology.org/2022.findings-acl.103.
- IsoScore: Measuring the uniformity of embedding space utilization. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 3325–3339, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.262. URL https://aclanthology.org/2022.findings-acl.262.
- Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023.
- Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23(9):1–34, 2022. URL http://jmlr.org/papers/v23/20-1111.html.
- Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=f2OYVDyfIB.
- Gemma: Open models based on gemini research and technology, 2024.
- A linear space representation of language probability through svd of n-gram matrix. Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 86(8):61–70, 2003. doi: https://doi.org/10.1002/ecjc.10106. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.10106.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HkwZSG-CZ.
- Tinyllama: An open-source small language model, 2024.
- Opt: Open pre-trained transformer language models, 2022.
- Frequency-based distortions in contextualized word embeddings. CoRR, abs/2104.08465, 2021. URL https://arxiv.org/abs/2104.08465.
- Nathan Godey (8 papers)
- Éric de la Clergerie (13 papers)
- Benoît Sagot (60 papers)