Conditioning of Random Feature Matrices: Double Descent and Generalization Error (2110.11477v2)

Published 21 Oct 2021 in stat.ML, cs.LG, math.OC, and math.PR

Abstract: We provide (high probability) bounds on the condition number of random feature matrices. In particular, we show that if the complexity ratio $\frac{N}{m}$ where $N$ is the number of neurons and $m$ is the number of data samples scales like $\log^{-1}(N)$ or $\log(m)$, then the random feature matrix is well-conditioned. This result holds without the need of regularization and relies on establishing various concentration bounds between dependent components of the random feature matrix. Additionally, we derive bounds on the restricted isometry constant of the random feature matrix. We prove that the risk associated with regression problems using a random feature matrix exhibits the double descent phenomenon and that this is an effect of the double descent behavior of the condition number. The risk bounds include the underparameterized setting using the least squares problem and the overparameterized setting where using either the minimum norm interpolation problem or a sparse regression problem. For the least squares or sparse regression cases, we show that the risk decreases as $m$ and $N$ increase, even in the presence of bounded or random noise. The risk bound matches the optimal scaling in the literature and the constants in our results are explicit and independent of the dimension of the data.

Citations (12)

View on Semantic Scholar

Summary

The paper derives tight bounds for the condition number of random feature matrices, identifying criteria for well-conditioning without regularization.
It demonstrates the double descent phenomenon, where risk peaks near the interpolation threshold and decreases in the overparameterized regime.
It establishes generalization error bounds for both under- and overparameterized models, offering insights for optimal neural network design.

Conditioning of Random Feature Matrices: Double Descent and Generalization Error

The paper explores the conditioning of random feature matrices, particularly focusing on the phenomenon of double descent and its implications for generalization error in machine learning models. The authors, Zhijun Chen and Hayden Schaeffer, provide a rigorous analysis of the condition number of random feature matrices and its impact on learning tasks such as regression and sparse regression.

Key Contributions

Condition Number Analysis:
- The authors derive bounds for the condition number of random feature matrices, showing that these matrices are well-conditioned when the complexity ratio $\frac{N}{m}$ (where $N$ is the number of neurons and $m$ is the number of data samples) scales according to specific logarithmic functions. This analysis holds with high probability and does not require regularization.
Double Descent Phenomenon:
- The research demonstrates that the condition number of random feature matrices exhibits the double descent phenomenon. This is characterized by a peak in the risk near the interpolation threshold $N = m$ , followed by a decrease in risk as the model continues into the overparameterized regime ( $N/m > 1$ ).
Generalization Error Bounds:
- The paper provides bounds on the generalization error for both underparameterized and overparameterized regimes. For least squares problems, the risk decreases as the number of samples and features increase, aligning with optimal scaling in the literature. In the overparameterized regime, the paper treats both the min-norm interpolation problem and sparse regression, showing that effective error bounds can be achieved.

Methodology

Random Feature Matrix: Defined by $A = \phi(X^T\boldsymbol{W})$ , where $X$ represents data samples and $\boldsymbol{W}$ is a random weight matrix. The function $\phi$ applies a Lipschitz activation function to the input.
Theoretical Framework: Utilizes concentration bounds between dependent components of the random feature matrix to establish the behavior of singular values. These concentration bounds are crucial for understanding the conditioning and thus the generalization ability of the models.
Risk Analysis: Detailed risk analysis is conducted for various regression scenarios using the condition number as a critical parameter in assessing the error landscape.

Implications and Future Directions

The findings have significant implications for the design and training of machine learning models. The understanding of double descent and its relationship with the condition number provides insights into parameter selection, model architecture, and training regimes.

Practical Applications: By understanding the conditioning properties, practitioners can better design neural networks that maximize generalization performance by avoiding poorly conditioned regions in the parameter space.
Extension to Neural Networks: Although the current analysis focuses on random feature models, similar principles can be extended to fully trained neural networks, providing a theoretical basis for weight initialization and normalization strategies.
Future Research: Further exploration into other features and probability distributions, and expanding the scope to deep neural networks, could yield richer insights into modern machine learning architectures.

This work represents a step towards a more nuanced understanding of the tradeoffs in model complexity, parameterization, and generalization, contributing to the optimization of artificial intelligence systems.

PDF Markdown