A Spectral Condition for Feature Learning (2310.17813v2)

Published 26 Oct 2023 in cs.LG

Abstract: The push to train ever larger neural networks has motivated the study of initialization and training at large network width. A key challenge is to scale training so that a network's internal representations evolve nontrivially at all widths, a process known as feature learning. Here, we show that feature learning is achieved by scaling the spectral norm of weight matrices and their updates like $\sqrt{\texttt{fan-out}/\texttt{fan-in}}$, in contrast to widely used but heuristic scalings based on Frobenius norm and entry size. Our spectral scaling analysis also leads to an elementary derivation of \emph{maximal update parametrization}. All in all, we aim to provide the reader with a solid conceptual understanding of feature learning in neural networks.

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates that scaling weight matrices by their spectral norm enables effective feature learning in wide neural networks.
It introduces clear hyperparameter prescriptions that match the spectral condition, simplifying cross-model tuning and training dynamics.
Empirical results on tasks like CIFAR-10 validate that this spectral approach outperforms traditional parametrizations such as the neural tangent method.

A Spectral Condition for Feature Learning: A Summary

The paper "A Spectral Condition for Feature Learning" by Greg Yang, James B. Simon, and Jeremy Bernstein aims to address the scaling of training in large neural networks to achieve feature learning. The authors assert that the spectral norm of weight matrices and their updates should be scaled according to $\sqrt{\text{fan-out}/\text{fan-in}}$ , a departure from typical heuristics based on Frobenius norm and entry size.

The Problem of Feature Learning in Large Neural Networks

Recent advancements in deep learning involve deploying increasingly parameter-rich models. Such models often exhibit emergent capabilities across various tasks. However, one critical challenge that arises is ensuring that feature learning, i.e., the evolution of internal representations, occurs effectively at all widths. Existing parameter scaling rules, such as the neural tangent parametrization (NTP), often fall short in maintaining feature learning at large network widths. The paper underscores the necessity of correct feature learning for optimal model performance and the significance of hyperparameter transfer made possible by maximal update parametrization ( $\mu$ P).

Spectral Norm Scaling and Maximal Update Parametrization

The crux of the paper is the proposal of a spectral scaling condition that prescribes the scaling of the spectral norm of weight matrices and updates. By deriving the maximal update parametrization using straightforward linear algebra, the authors simplify the understanding of $\mu$ P. Essentially, the condition requires that for any weight matrix $\mathbf{W}_\ell \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$ or its update $\Delta \mathbf{W}_\ell$ , the spectral norms satisfy: $\|\mathbf{W}_\ell\|_* = \Theta \left( \sqrt{\frac{d_\ell}{d_{\ell-1}}} \right) \quad \text{and} \quad \|\Delta \mathbf{W}_\ell\|_* = \Theta \left( \sqrt{\frac{d_\ell}{d_{\ell-1}}} \right).$ This condition ensures that the hidden features and their updates have the correct magnitude, thereby facilitating effective feature evolution.

Implementing the Spectral Scaling Condition

To implement this spectral scaling in practice, the paper provides hyperparameter prescriptions for the initialization scale and learning rates of each layer: $\sigma_\ell = \Theta \left( \frac{1}{\sqrt{d_{\ell-1}} \min \left\{ 1, \sqrt{\frac{d_\ell}{d_{\ell-1}}} \right\}} \right) \quad \text{and} \quad \eta_\ell = \Theta \left( \frac{d_{\ell}}{d_{\ell-1}} \right).$ These prescriptions ensure that the spectral norms are appropriately scaled without cumbersome direct spectral normalization.

Empirical Validation and Broader Applicability

The authors provide empirical validation that the spectral scaling condition fosters proper feature learning through experiments with multilayer perceptrons on CIFAR-10. Moreover, they demonstrate the deficiency in other popular parametrizations, such as the neural tangent parametrization and standard parameterization, which do not adhere to the spectral scaling condition.

Implications and Future Directions

Theoretical Implications: The spectral scaling condition offers a unified framework to understand and derive hyperparameter scaling rules for various architectures and optimizers. By casting these rules in terms of spectral norms, the authors bridge the gap between theoretical understanding and empirical performance.

Practical Implications: The spectral condition simplifies the implementation of hyperparameter scaling in code and provides clear guidance for tuning models of varying width. This can enhance the transferability of hyperparameters across models, leading to more efficient training processes.

Future Developments: Anticipated future work involves extending the spectral scaling framework to more complex architectures, such as transformers and convolutional neural networks. Another prospective direction is exploring the interaction of the spectral scaling condition with other optimization techniques beyond those currently studied.

Conclusion

The paper provides a compelling case for the utility of spectral norm scaling in ensuring feature learning in large neural networks. By formulating a clear and accessible spectral condition, the authors offer a practical tool to enhance training dynamics and hyperparameter transferability. This approach not only aligns with the maximal update parametrization but also charts a path towards more robust and theoretically grounded deep learning practices.

Related Papers

Tweets

https://twitter.com/jxbz/status/1819846348130418706

https://twitter.com/JingyuanLiu123/status/1931223767449309657

https://twitter.com/jxbz/status/1807745513132818612

https://twitter.com/jxbz/status/1794399936311919000

https://twitter.com/jxbz/status/1798757445738889400

https://twitter.com/jxbz/status/1822305441159745846

YouTube

Show All Videos

HackerNews

xAI: A Spectral Condition for Feature Learning (3 points, 0 comments)