Linguistic Collapse: Neural Collapse in (Large) Language Models (2405.17767v3)

Published 28 May 2024 in cs.LG, cs.CL, and stat.ML

Abstract: Neural collapse ($\mathcal{NC}$) is a phenomenon observed in classification tasks where top-layer representations collapse into their class means, which become equinorm, equiangular and aligned with the classifiers. These behaviours -- associated with generalization and robustness -- would manifest under specific conditions: models are trained towards zero loss, with noise-free labels belonging to balanced classes, which do not outnumber the model's hidden dimension. Recent studies have explored $\mathcal{NC}$ in the absence of one or more of these conditions to extend and capitalize on the associated benefits of ideal geometries. LLMling presents a curious frontier, as \textit{training by token prediction} constitutes a classification task where none of the conditions exist: the vocabulary is imbalanced and exceeds the embedding dimension; different tokens might correspond to similar contextual embeddings; and LLMs in particular are typically only trained for a few epochs. This paper empirically investigates the impact of scaling the architectures and training of causal LLMs (CLMs) on their progression towards $\mathcal{NC}$. We find that $\mathcal{NC}$ properties that develop with scale (and regularization) are linked to generalization. Moreover, there is evidence of some relationship between $\mathcal{NC}$ and generalization independent of scale. Our work thereby underscores the generality of $\mathcal{NC}$ as it extends to the novel and more challenging setting of LLMling. Downstream, we seek to inspire further research on the phenomenon to deepen our understanding of LLMs -- and neural networks at large -- and improve existing architectures based on $\mathcal{NC}$-related properties. Our code is hosted on GitHub at https://github.com/rhubarbwu/linguistic-collapse .

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that key neural collapse properties emerge with model scaling, leading to reduced within-class variability and improved geometric uniformity.
The methodology employs Transformer-based CLMs with varied widths, depths, and weight decay on the TinyStories dataset, using metrics like CDNV and cosine similarity.
The study finds significant correlations between neural collapse metrics and generalization, highlighting their potential role in enhancing model performance independent of scaling.

This paper, "Linguistic Collapse: Neural Collapse in (Large) LLMs" (Linguistic Collapse: Neural Collapse in (Large) Language Models, 28 May 2024), investigates the phenomenon of Neural Collapse ( $\mathcal{NC}$ ) in Causal LLMs (CLMs). Neural Collapse is a set of behaviors observed in deep neural networks trained for classification tasks, particularly during the terminal phase of training towards zero loss on balanced, noise-free data where the number of classes does not significantly exceed the embedding dimension. The key properties of $\mathcal{NC}$ traditionally include:

( $\mathcal{NC}_1$ ) Within-class variability collapse: Top-layer representations for inputs from the same class converge to their class mean.
( $\mathcal{NC}_2$ ) Convergence to a simplex ETF: The class means, when centered, tend towards an equinorm and equiangular configuration (Simplex Equiangular Tight Frame).
( $\mathcal{NC}_3$ ) Convergence to self-duality: Top-layer classifiers align with their corresponding class means.
( $\mathcal{NC}_4$ ) Nearest decision rule: The standard linear classifier becomes equivalent to a Nearest-Class Center (NCC) classifier.

The authors note that training LLMs via next-token prediction is essentially a classification task over the vocabulary. However, the conditions under which LMs are typically trained starkly contrast with those traditionally favoring $\mathcal{NC}$ :

Many classes (¬C1): The vocabulary size ( $C$ , tens of thousands) is much larger than the embedding dimension ( $d$ ). A perfect simplex ETF requires $C \leq d+1$ .
Imbalanced classes (¬C2): Token distributions in natural language are highly imbalanced.
Ambiguous contexts (¬C3): Similar contexts can lead to different valid next tokens (e.g., "Once upon a time" followed by "," or " in").
Undertraining (¬C4): LMs, especially large ones, are often not trained to full convergence or past zero error/loss in practice.

Given these conflicting conditions, the paper empirically investigates whether $\mathcal{NC}$ properties emerge in CLMs despite these challenges and how they relate to model scaling and generalization.

Empirical Investigation and Methodology

The paper trains a suite of Transformer-based CLMs (similar to GPT-Neo) on the TinyStories dataset (Revisiting the Alpha Algorithm To Enable Real-Life Process Discovery Applications -- Extended Report, 2023). They vary model width ( $d \in \{64, 128, \ldots, 1024\}$ ), depth ( $L \in \{1, 2, \ldots, 12\}$ ), and training epochs (1, 3, 10). They also experiment with different weight decay factors.

For each trained model, the authors collect top-layer context embeddings for validation data and the model's linear classifiers (the final output layer weights). They then compute metrics to quantify the degree of $\mathcal{NC}$ based on these embeddings and classifiers, adapting some metrics for the LM context:

$\mathcal{NC}_1$ (Within-Class Variability): Measured using the average Class-Distance Normalized Variance (CDNV) across token classes. Lower CDNV indicates less within-class variability relative to between-class distance.
$\mathcal{GNC}_2$ (Geometric Structure): Beyond traditional Equinormness (CoV of mean norms) and Equiangularity (CoV of pairwise interference), they specifically measure Hyperspherical Uniformity ( $\mathcal{GNC}_2$ ) using variation in pairwise logarithmic distances between normalized class means. This is motivated by prior work on generalized $\mathcal{NC}$ when $C > d+1$ .
$\mathcal{UNC}_3$ (Duality): Instead of just measuring the difference between normalized class means and classifiers ( $\mathcal{NC}_3$ , self-duality), they calculate the cosine similarity between each normalized class mean and its corresponding classifier vector. They introduce Uniform Duality ( $\mathcal{UNC}_3$ ) as the minimization of the Coefficient of Variation (CoV) of these similarities, indicating a more consistent alignment across classes.
$\mathcal{NC}_4$ (Classifier Agreement): Calculated as the proportion of validation samples where the linear classifier's prediction matches that of an implicit Nearest-Class Center (NCC) classifier based on the learned class means.

Generalization is measured by validation loss (next-token prediction cross-entropy).

To investigate the relationship between $\mathcal{NC}$ and generalization independent of scale, they train multiple instances of a single architecture (2-layer, 768-wide) with different random seeds for data shuffling and initialization, then perform a permutation test on the correlation between $\mathcal{NC}$ metrics and validation loss.

Key Findings and Practical Implications

The empirical results reveal several key insights:

Emergence of $\mathcal{NC}$ with Scaling: Despite the challenging conditions, several $\mathcal{NC}$ $N C$ properties emerge or strengthen as model size (width and depth) and training epochs increase.
- $\mathcal{NC}_1$ (CDNV) consistently decreases with scale and training, indicating reduced within-class variability.
- Mean embedding norms grow, and their variation (Equinormness, part of $\mathcal{NC}_2$ ) decreases with scale.
- Average interference decreases, but variation in interference (Equiangularity, traditional $\mathcal{NC}_2$ ) doesn't consistently decrease with scale, supporting the idea that a perfect simplex ETF is not formed when $C \gg d+1$ .
- Hyperspherical Uniformity ( $\mathcal{GNC}_2$ , variation in logarithmic distances) consistently improves with scale and training, confirming its relevance in this setting.
- Average similarity between class means and classifiers ( $\mathcal{NC}_3$ , self-duality) shows weak trends with scale, but variation in similarity ( $\mathcal{UNC}_3$ , uniform duality) decreases with width and training.
- $\mathcal{NC}_4$ (Classifier Agreement) improves significantly with scale and training.
Correlation with Generalization: The observed developments in $\mathcal{NC}$ $N C$ properties are strongly correlated with improved validation performance (lower validation loss).
- $\mathcal{NC}_1$ , $\mathcal{GNC}_2$ , $\mathcal{UNC}_3$ , and $\mathcal{NC}_4$ show notable correlations with generalization.
- Traditional $\mathcal{NC}_2$ (Equiangularity) and $\mathcal{NC}_3$ (Self-Duality) show weaker correlations compared to their generalized/uniform counterparts in this LM setting.
$\mathcal{NC}$ and Generalization Independent of Scale: The permutation test on models with identical architecture but different random seeds reveals that several $\mathcal{NC}$ properties ( $\mathcal{NC}_1$ , $\mathcal{GNC}_2$ , $\mathcal{NC}_3$ , $\mathcal{NC}_4$ , and traditional $\mathcal{NC}_2$ Equiangularity) are statistically significantly correlated with generalization performance even when scale and training time are fixed. This suggests that $\mathcal{NC}$ is not merely a side effect of scaling and training, but potentially a more fundamental aspect of model performance and generalization in LMs.
Weight Decay: Stronger weight decay appeared to promote the development of $\mathcal{NC}$ properties.

Implementation Considerations and Future Work

This research is primarily an empirical analysis rather than proposing a new implementation technique. However, the methodology suggests practical ways to analyze the feature space and classifiers of existing or newly trained LMs:

Monitoring $\mathcal{NC}$ Metrics: Developers can implement the described metrics (CDNV for $\mathcal{NC}_1$ , log-distance variation for $\mathcal{GNC}_2$ , similarity CoV for $\mathcal{UNC}_3$ , classifier agreement for $\mathcal{NC}_4$ ) during or after training to gain insights into the model's feature learning and its potential for generalization.
Feature Space Analysis: The metrics provide low-level interpretability by quantifying aspects like class separability ( $\mathcal{NC}_1$ ), the geometric arrangement of classes ( $\mathcal{GNC}_2$ ), and the consistency of classifier alignment ( $\mathcal{UNC}_3$ ). This could help diagnose issues like poor separation for specific tokens or groups of tokens.
Potential for New Objectives: The findings could inspire research into training objectives that explicitly encourage certain $\mathcal{NC}$ properties in LMs, similar to how feature regularization is used in imbalanced image classification. For instance, adding terms to the loss that penalize high CDNV or high CoV of log-distances might promote better generalization.
Understanding Ambiguity and Compression: The persistent noise ( $\mathcal{NC}_1$ ) due to ambiguous contexts might relate to LLMs' ability to model aleatoric uncertainty or their function as data compression systems, as suggested by the authors.

Limitations

The authors acknowledge limitations, including that the chosen $\mathcal{NC}$ metrics might not be perfectly suited for all aspects of LLMing collapse. The paper focuses on basic causal LLMing and does not include experiments on more complex settings like encoder-decoder models, multi-modal models, or instruction-tuned models. The scale-independent correlation analysis was performed only on a single, relatively small architecture, and results might not directly translate to much larger models.

In summary, the paper successfully adapts the Neural Collapse framework to the challenging domain of LLMing, providing empirical evidence that $\mathcal{NC}$ properties emerge with scale and training and are correlated with generalization, even independent of scale. This work lays the groundwork for deeper understanding and potentially improved architectures for LMs based on $\mathcal{NC}$ -related insights.

PDF Markdown

Related Papers

Tweets

https://twitter.com/realmofresearch/status/1796521079332380848