Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data (2301.00437v5)

Published 1 Jan 2023 in cs.LG and stat.ML

Abstract: Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means, and those class-means are the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is known as Neural Collapse (NC). Recent papers have theoretically shown that NC emerges in the global minimizers of training problems with the simplified "unconstrained feature model". In this context, we take a step further and prove the NC occurrences in deep linear networks for the popular mean squared error (MSE) and cross entropy (CE) losses, showing that global solutions exhibit NC properties across the linear layers. Furthermore, we extend our study to imbalanced data for MSE loss and present the first geometric analysis of NC under bias-free setting. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios.

Authors (6)

Hien Dang (4 papers)
Tho Tran (5 papers)
Stanley Osher (104 papers)
Hung Tran-The (10 papers)
Nhat Ho (126 papers)
Tan Nguyen (26 papers)

Citations (26)

View on Semantic Scholar

Summary

The paper extends theoretical analysis of Neural Collapse to deep linear networks, showing that global minimizers exhibit structured collapse (ETF, OF, GOF) for both balanced and imbalanced data.
Using the Unconstrained Features Model and SVD, the study links optimal singular values to network depth, architecture, and regularization parameters.
The findings provide practical insights for architecture design, optimization strategies, and mitigating minority collapse in imbalanced classification.

This paper, "Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data" (Dang et al., 2023 ), provides a theoretical analysis of the Neural Collapse (NC) phenomenon in deep linear networks under the Unconstrained Features Model (UFM). The UFM is a simplified setting where the features from the penultimate layer of a deep neural network are treated as learnable parameters, allowing for a more tractable analysis of the optimization landscape and the properties of global minimizers.

The core empirical observation motivating this research is Neural Collapse ( $\mathcal{NC}$ ), which describes several structural properties observed in the last-layer features and classifiers of modern deep neural networks trained to near-zero training error. These properties include:

( $\mathcal{NC}1$ ) Variability collapse: Features belonging to the same class converge to a single vector (the class mean).
( $\mathcal{NC}2$ ) Convergence to a structured frame: The class means collapse to the vertices of a simplex Equiangular Tight Frame (ETF).
( $\mathcal{NC}3$ ) Convergence to self-duality: The last-layer classifiers become aligned with the class means.
( $\mathcal{NC}4$ ) Simplification to nearest class-center: The classification rule simplifies to choosing the class whose mean is closest to the input feature.

While previous work has studied NC in shallow models or under specific assumptions, this paper extends the theoretical analysis to deep linear networks with multiple learnable layers after the unconstrained features, considering both balanced and imbalanced data settings for the Mean Squared Error (MSE) loss, and the balanced setting for the Cross-Entropy (CE) loss.

Problem Formulation:

The paper considers an extension of the standard UFM where the unconstrained features $\mathbf{H}_1 \in \mathbb{R}^{d_1 \times N}$ are followed by $M \ge 1$ linear layers with weight matrices $\mathbf{W}_1, \ldots, \mathbf{W}_M$ , culminating in a final classification layer $\mathbf{W}_M \in \mathbb{R}^{K \times d_M}$ . The total output is $\mathbf{W}_M \ldots \mathbf{W}_1 \mathbf{h}_{k,i} + \mathbf{b}$ . The optimization objective includes the classification loss (MSE or CE) and L2 regularization on all weight matrices $\mathbf{W}_j$ , the features $\mathbf{H}_1$ , and potentially the final bias $\mathbf{b}$ .

The analysis focuses on characterizing the properties of global minimizers of this regularized objective function.

Neural Collapse in Deep Linear Networks (Balanced Data):

For balanced data ( $n_k = n$ for all classes $k$ ) and MSE loss, the paper proves that global minimizers exhibit NC properties across all layers.

( $\mathcal{NC}1$ ) is shown for the features $\mathbf{H}_1$ , meaning $\mathbf{h}_{k,i}$ collapses to a class-specific mean $\mathbf{h}_k$ .
( $\mathcal{NC}2$ ) and ( $\mathcal{NC}3$ ) are generalized to hold for the product of any sequence of consecutive weight matrices, or the product of feature means and weights. Specifically, for the bias-free case, the product of weight matrices and the matrix of class means $\overline{\mathbf{H}} = [\mathbf{h}_1, \ldots, \mathbf{h}_K]$ converge to an Orthogonal Frame (OF) geometry (proportional to $\mathbf{I}_K$ when viewed in the appropriate subspace). If a last-layer bias is included (and unregularized), they converge to a Simplex ETF geometry (proportional to $\mathbf{I}_K - \frac{1}{K}\mathbf{1}_K\mathbf{1}_K^\top$ ).
The geometry depends on the bottleneck rank $R = \min(K, d_M, \ldots, d_1)$ . If $R \ge K$ (or $R \ge K-1$ for the bias case), the full OF/ETF structure emerges. If $R < K$ , it converges to a low-rank projection of the ideal structure.
The existence of non-trivial global minimizers depends on the regularization parameters $\lambda$ being below a certain threshold related to the number of layers $M$ and data samples $N$ .

The theoretical analysis involves characterizing critical points and leveraging singular value decomposition (SVD) to simplify the loss function, showing that optimal singular values are related across layers and determined by minimizing a scalar function dependent on $M$ and regularization.

For balanced data and CE loss (discussed in Appendix A), similar NC properties are shown to hold for global minimizers of deep linear networks, with convergence towards the Simplex ETF geometry.

Neural Collapse in Deep Linear Networks (Imbalanced Data):

A significant contribution is the analysis of NC under imbalanced data for the MSE loss. This setting is more reflective of real-world data distributions. The paper provides the first geometric analysis of NC in this scenario for the unconstrained feature model.

( $\mathcal{NC}1$ ) is preserved: Features within the same class still collapse to their class mean.
( $\mathcal{NC}3$ ) is modified: The alignment between the class means and the last-layer classifier (or product of weights) is still present, but the scaling factor between $\mathbf{w}_k$ and $\mathbf{h}_k$ depends on the class size $n_k$ .
( $\mathcal{NC}2$ ) geometry changes: The class means and weight matrix products converge to a General Orthogonal Frame (GOF). This geometry consists of orthogonal vectors, but their lengths are proportional to a function of the class size $n_k$ and regularization parameters. Specifically, $\| \mathbf{w}_k \|^2$ is proportional to $\sqrt{\frac{n_k \lambda_H}{\lambda_W}} - N\lambda_H$ and $\| \mathbf{h}_k \|^2$ is inversely proportional to this term, relating to empirical observations that majority classes have larger classifier norms.

The analysis also identifies conditions under which "Minority Collapse" occurs, meaning the classifiers and features for minority classes collapse to the zero vector. This happens when the imbalance ratio and regularization are above a certain threshold relative to the class size.

For deep linear networks ( $M \ge 2$ ) with imbalanced data and MSE loss, the NC1 and NC3 properties generalize similarly. The NC2 geometry is also a GOF structure, with lengths of the orthogonal vectors determined by a more complex relationship involving the number of layers $M$ and regularization. The bottleneck case ( $R < K$ ) also leads to low-rank approximations of the GOF structure.

Practical Implications and Implementation:

Understanding Deep Networks: Analyzing deep linear networks provides valuable insights into the dynamics and terminal phase properties of deep non-linear networks, as suggested by related work. The observation that NC properties propagate through multiple layers implies a structured representation learned throughout the network depth in its final training stages.
Architecture Design: The results about bottleneck effects suggest that having sufficient width ( $d_m \ge K$ or $K-1$ ) in linear layers is important for achieving the full desired geometric collapse (ETF/OF/GOF). This aligns with empirical findings that larger network widths tend to promote NC.
Optimization: The analysis of critical points and their relation to global minima provides theoretical backing for why common optimization methods like SGD might converge to NC states in overparameterized settings.
Imbalanced Learning: The GOF structure and Minority Collapse phenomenon offer theoretical explanations for challenges in imbalanced classification. The dependence of classifier/feature norms on class size and the condition for minority classes collapsing to zero provide concrete targets for designing better loss functions or regularization strategies for imbalanced data. For example, methods aiming to equalize norms or prevent collapse for minority classes are supported by this analysis.
Regularization: The paper highlights the critical role of regularization parameters $\lambda$ in determining whether non-trivial NC solutions exist and, in the imbalanced case, influencing the severity of Minority Collapse. Tuning regularization correctly is crucial for achieving desirable performance, especially on minority classes.

Experimental Validation:

The paper empirically validates its theoretical findings using:

MLP, ResNet18, and VGG16 backbones followed by deep linear layers on image classification datasets (CIFAR10, EMNIST).
Direct optimization of the UFM objective with deep linear layers on synthetic data.
Experiments on text classification datasets.

These experiments confirm that NC properties, including convergence to OF/ETF for balanced data and GOF for imbalanced data, are observed in practice for various depths and widths of linear layers, even when coupled with non-linear backbones. The Minority Collapse phenomenon is also empirically observed and aligns with the theoretical predictions.

In summary, this paper significantly advances the theoretical understanding of Neural Collapse by extending the analysis to deep linear networks and providing the first comprehensive geometric characterization of NC under imbalanced data using the UFM. The results offer practical insights into network architecture, optimization, and the challenges of imbalanced learning in deep models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/alignment_lab/status/1842398601244254297