Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck (2404.07647v1)

Published 11 Apr 2024 in cs.CL

Abstract: Recent advances in LLMing consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. The falcon series of open language models, 2023.
  2. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  3. Too much in common: Shifting of embeddings in transformer language models and its implications. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5117–5130, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.403. URL https://aclanthology.org/2021.naacl-main.403.
  4. Language models are few-shot learners, 2020.
  5. Intrinsic dimension estimation: Advances and open problems. Information Sciences, 328:26–41, 2016. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2015.08.029. URL https://www.sciencedirect.com/science/article/pii/S0020025515006179.
  6. Softmax bottleneck makes language models unable to represent multi-mode word distributions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8048–8073, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.554. URL https://aclanthology.org/2022.acl-long.554.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  8. Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  55–65, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006.
  9. Croissantllm: A truly bilingual french-english language model, 2024.
  10. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkEYojRqtm.
  11. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  12. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  13. Anisotropy is inherent to self-attention in transformers, 2024.
  14. Training compute-optimal large language models, 2022.
  15. Mistral 7b, 2023.
  16. Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YevsQ05DEN7.
  17. Sigsoftmax: Reanalysis of the softmax bottleneck. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/9dcb88e0137649590b755372b040afad-Paper.pdf.
  18. Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020. URL https://api.semanticscholar.org/CorpusID:210861095.
  19. N. Kishore Kumar and J. Schneider. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 65(11):2212–2244, 2017. doi: 10.1080/03081087.2016.1267104. URL https://doi.org/10.1080/03081087.2016.1267104.
  20. Mitigating data imbalance and representation degeneration in multilingual machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14279–14294, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.953. URL https://aclanthology.org/2023.findings-emnlp.953.
  21. Ying-Chen Lin. Breaking the softmax bottleneck for sequential recommender systems with dropout and decoupling, 2021.
  22. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  23. A natural bias for language generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  243–255, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.22. URL https://aclanthology.org/2023.acl-short.22.
  24. Pointer sentinel mixture models, 2016.
  25. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022. doi: 10.1162/tacl˙a˙00493. URL https://aclanthology.org/2022.tacl-1.49.
  26. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
  27. The impact of depth and width on transformer language model generalization, 2023.
  28. Outlier dimensions that disrupt transformers are driven by frequency. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1286–1304, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.93.
  29. Language models are unsupervised multitask learners. 2019.
  30. A cluster-based approach for improving isotropy in contextual embedding space. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  575–584, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.73. URL https://aclanthology.org/2021.acl-short.73.
  31. An isotropy analysis in the multilingual BERT embedding space. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  1309–1316, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.103. URL https://aclanthology.org/2022.findings-acl.103.
  32. IsoScore: Measuring the uniformity of embedding space utilization. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  3325–3339, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.262. URL https://aclanthology.org/2022.findings-acl.262.
  33. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023.
  34. Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23(9):1–34, 2022. URL http://jmlr.org/papers/v23/20-1111.html.
  35. Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=f2OYVDyfIB.
  36. Gemma: Open models based on gemini research and technology, 2024.
  37. A linear space representation of language probability through svd of n-gram matrix. Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 86(8):61–70, 2003. doi: https://doi.org/10.1002/ecjc.10106. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.10106.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023.
  39. Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HkwZSG-CZ.
  40. Tinyllama: An open-source small language model, 2024.
  41. Opt: Open pre-trained transformer language models, 2022.
  42. Frequency-based distortions in contextualized word embeddings. CoRR, abs/2104.08465, 2021. URL https://arxiv.org/abs/2104.08465.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nathan Godey (8 papers)
  2. Éric de la Clergerie (13 papers)
  3. Benoît Sagot (60 papers)
Citations (5)

Summary

  • The paper demonstrates that small language models suffer performance saturation due to the softmax bottleneck limiting the expressiveness of their linear prediction heads.
  • Empirical evaluations show that models with fewer than 1000 hidden dimensions develop degenerate latent representations and increased anisotropy in their last layers.
  • Spectral analysis indicates that the saturation of singular values correlates with performance decline, suggesting a need for alternative scaling and optimization strategies.

Unraveling Performance Saturation in Small LLMs through a Spectral Lens

Overview of Saturation in Small LMs

Recent discussions within the NLP research community have acknowledged a peculiar phenomenon known as "performance saturation" in small LLMs (LMs), specifically those trained on expansive text corpora. This phenomenon is classified by a notable decline in model performance during later stages of training, leading to stagnant or deteriorating evaluation metrics. A compelling analysis provided in this paper links this saturation to a discord between the smaller models' hidden dimensions and the inherently high rank of their target contextual probability distributions. It's further theorized that this discord manifests through the softmax bottleneck, a well-documented limitation that impacts the expressiveness of LMs' linear prediction heads. Through rigorous examination, it has been demonstrated that models with less than 1000 hidden dimensions tend to develop degenerate latent representations as training progresses, correlating with their diminished performance.

Investigating Saturation and Representation Degeneration

One of the key contributions of this research is the comprehensive characterization of performance saturation through empirical evaluation and the extrapolation of scaling laws. The paper meticulously analyzes the saturation trajectory, showcasing how smaller LMs, particularly those within the Pythia model suite, exhibit a degradation in performance once a certain threshold in the training process is reached. This decline is significantly correlated with an increase in the anisotropy of the models' last-layer representations – a sign of narrowing angular variability and a potential indicator of representational degeneration. Further, the spectral analysis of the LMs' linear prediction heads reveals a saturation of singular value distributions, implying a uniformization trend that precedes a rapid escalation towards degenerate states.

The Softmax Bottleneck and the High Rank of Contextual Distributions

This work extends the conversation around the softmax bottleneck by quantitatively assessing its impact on smaller LMs and their ability to model high-dimensional data distributions effectively. The insights into the rank constraints of the linear LLMing head—underscored by experiments involving rank-constrained heads on pre-trained models—emphasize a critical bottleneck dimension. Performance declines are observed when the linear model head's rank falls below 1000, irrespective of output representation expressiveness. This phenomenon underlines a theoretical foundation which suggests that the contextual distribution's inherent complexity often exceeds the representational capacity of smaller LMs, a challenge that becomes pronounced in the presence of a softmax bottleneck.

Implications and Future Research Directions

The correlation between last-layer anisotropy, singular value saturation, and performance degradation opens several avenues for future research. Addressing the softmax bottleneck through alternative architectural or optimization strategies could mitigate the saturation phenomenon, potentially enhancing the efficiency and efficacy of smaller LMs. Furthermore, this paper prompts a reconsideration of model scaling strategies, particularly the balancing act between model size, depth, and hidden dimensionality to circumvent the identified bottlenecks without compromising on model performance.

Conclusion

In conclusion, this analysis presents a nuanced understanding of the performance saturation phenomenon in small LLMs, attributing the occurrence to a mix of representational degeneration and limitations imposed by the softmax bottleneck. The findings not only highlight the challenges of training smaller LMs on large datasets but also point towards potential mitigation strategies that could refine future model development and training paradigms. By dissecting the spectral characteristics of LLMs and their implications on performance, this paper significantly contributes to our theoretical and practical understanding of model scaling laws and optimization constraints in the field of natural language processing.

HackerNews