Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck (2404.07647v1)

Published 11 Apr 2024 in cs.CL

Abstract: Recent advances in LLMing consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.

References (42)

Authors (3)

Nathan Godey (8 papers)
Éric de la Clergerie (13 papers)
Benoît Sagot (60 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that small language models suffer performance saturation due to the softmax bottleneck limiting the expressiveness of their linear prediction heads.
Empirical evaluations show that models with fewer than 1000 hidden dimensions develop degenerate latent representations and increased anisotropy in their last layers.
Spectral analysis indicates that the saturation of singular values correlates with performance decline, suggesting a need for alternative scaling and optimization strategies.

Unraveling Performance Saturation in Small LLMs through a Spectral Lens

Overview of Saturation in Small LMs

Recent discussions within the NLP research community have acknowledged a peculiar phenomenon known as "performance saturation" in small LLMs (LMs), specifically those trained on expansive text corpora. This phenomenon is classified by a notable decline in model performance during later stages of training, leading to stagnant or deteriorating evaluation metrics. A compelling analysis provided in this paper links this saturation to a discord between the smaller models' hidden dimensions and the inherently high rank of their target contextual probability distributions. It's further theorized that this discord manifests through the softmax bottleneck, a well-documented limitation that impacts the expressiveness of LMs' linear prediction heads. Through rigorous examination, it has been demonstrated that models with less than 1000 hidden dimensions tend to develop degenerate latent representations as training progresses, correlating with their diminished performance.

Investigating Saturation and Representation Degeneration

One of the key contributions of this research is the comprehensive characterization of performance saturation through empirical evaluation and the extrapolation of scaling laws. The paper meticulously analyzes the saturation trajectory, showcasing how smaller LMs, particularly those within the Pythia model suite, exhibit a degradation in performance once a certain threshold in the training process is reached. This decline is significantly correlated with an increase in the anisotropy of the models' last-layer representations – a sign of narrowing angular variability and a potential indicator of representational degeneration. Further, the spectral analysis of the LMs' linear prediction heads reveals a saturation of singular value distributions, implying a uniformization trend that precedes a rapid escalation towards degenerate states.

The Softmax Bottleneck and the High Rank of Contextual Distributions

This work extends the conversation around the softmax bottleneck by quantitatively assessing its impact on smaller LMs and their ability to model high-dimensional data distributions effectively. The insights into the rank constraints of the linear LLMing head—underscored by experiments involving rank-constrained heads on pre-trained models—emphasize a critical bottleneck dimension. Performance declines are observed when the linear model head's rank falls below 1000, irrespective of output representation expressiveness. This phenomenon underlines a theoretical foundation which suggests that the contextual distribution's inherent complexity often exceeds the representational capacity of smaller LMs, a challenge that becomes pronounced in the presence of a softmax bottleneck.

Implications and Future Research Directions

The correlation between last-layer anisotropy, singular value saturation, and performance degradation opens several avenues for future research. Addressing the softmax bottleneck through alternative architectural or optimization strategies could mitigate the saturation phenomenon, potentially enhancing the efficiency and efficacy of smaller LMs. Furthermore, this paper prompts a reconsideration of model scaling strategies, particularly the balancing act between model size, depth, and hidden dimensionality to circumvent the identified bottlenecks without compromising on model performance.

Conclusion

In conclusion, this analysis presents a nuanced understanding of the performance saturation phenomenon in small LLMs, attributing the occurrence to a mix of representational degeneration and limitations imposed by the softmax bottleneck. The findings not only highlight the challenges of training smaller LMs on large datasets but also point towards potential mitigation strategies that could refine future model development and training paradigms. By dissecting the spectral characteristics of LLMs and their implications on performance, this paper significantly contributes to our theoretical and practical understanding of model scaling laws and optimization constraints in the field of natural language processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nthngdy/status/1779895912061403627

https://twitter.com/Euclaise_/status/1847744702860448076

https://twitter.com/RiversHaveWings/status/1782609686824694077

https://twitter.com/Lovre_/status/1848484988800368785

https://twitter.com/SwankyView/status/1782427989080211652

https://twitter.com/SwankyView/status/1786051855383962051

HackerNews

Why do small language models underperform? (4 points, 1 comment)

Reddit

"Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck", Godey et al 2024 (large BPE vocab tokenization can destroy LLM scaling by blocking training after enough steps) (26 points, 21 comments)