High-dimensional ridge regression with random features for non-identically distributed data with a variance profile (2504.03035v1)

Published 3 Apr 2025 in stat.ML, cs.LG, math.PR, math.ST, stat.ME, and stat.TH

Abstract: The behavior of the random feature model in the high-dimensional regression framework has become a popular issue of interest in the machine learning literature}. This model is generally considered for feature vectors $x_i = \Sigma^{1/2} x_i'$, where $x_i'$ is a random vector made of independent and identically distributed (iid) entries, and $\Sigma$ is a positive definite matrix representing the covariance of the features. In this paper, we move beyond {\CB this standard assumption by studying the performances of the random features model in the setting of non-iid feature vectors}. Our approach is related to the analysis of the spectrum of large random matrices through random matrix theory (RMT) {\CB and free probability} results. We turn to the analysis of non-iid data by using the notion of variance profile {\CB which} is {\CB well studied in RMT.} Our main contribution is then the study of the limits of the training and {\CB prediction} risks associated to the ridge estimator in the random features model when its dimensions grow. We provide asymptotic equivalents of these risks that capture the behavior of ridge regression with random features in a {\CB high-dimensional} framework. These asymptotic equivalents, {\CB which prove to be sharp in numerical experiments}, are retrieved by adapting, to our setting, established results from operator-valued free probability theory. Moreover, {\CB for various classes of random feature vectors that have not been considered so far in the literature}, our approach allows to show the appearance of the double descent phenomenon when the ridge regularization parameter is small enough.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper derives asymptotic equivalents for training and test risks in high-dimensional random features ridge regression applied to non-iid data with variable feature variances.
It introduces lozenge and square equivalents, where the square equivalent uses free probability to obtain deterministic risk formulas under specific variance profile assumptions.
The framework enables predicting double descent and practical performance in heterogeneous data settings, validated through mixture models such as MNIST.

This paper investigates the performance of Random Features (RF) ridge regression in a high-dimensional setting where the input data features are independent but not identically distributed (non-i.i.d.). This extends previous analyses that typically assume i.i.d. data. The non-i.i.d. structure is captured using the concept of a "variance profile," where each feature vector $x_i$ can have components with different variances, modeled as $x_i = \Sigma_i^{1/2} x_i'$ with $\Sigma_i = \text{diag}(\gamma_{ij}^2)$ . The data matrix can be written as $X_n = \Upsilon_x \circ X_n'$ , where $\Upsilon_x = (\gamma_{ij})$ and $\Gamma_n = (\gamma_{ij}^2)$ is the variance profile matrix.

The core problem is standard RF ridge regression: finding weights $\hat{\theta}_\lambda$ to minimize $\frac{1}{n} \| Y_n - H^\top \theta \|^2 + \lambda \| \theta \|^2$ , where $H = h(WX_n^\top / \sqrt{p})$ is the matrix of activated random features, $W$ is a matrix of random weights (typically with i.i.d. entries, although the paper allows for a variance profile on $W$ too, specializing later to constant profile), and $h$ is an activation function. The analysis assumes the ground truth follows a linear model $y_i = x_i^\top \beta_* + \epsilon_i$ with random coefficients $\beta_*$ (average case analysis).

The main contribution is the derivation of asymptotic equivalents for the training risk $E_{train}(\lambda)$ and the prediction risk $E_{test}(\lambda)$ in the high-dimensional limit where the number of samples $n$ , features $p$ , and random features $m$ grow proportionally ( $p/n \to c_p, m/n \to c_m$ ). Two types of equivalents are derived:

Lozenge Equivalents ( $E^\lozenge_{train}, E^\lozenge_{test}$ ): These are obtained by leveraging a generalization of the "linear-plus-chaos" approximation for the matrix $H$ , denoted $H^\lozenge$ . This approximation, detailed in [DaboMale], accounts for the variance profiles on both $X_n$ and $W$ . For the specific case considered where $W$ has a constant variance profile ( $\Upsilon_w = \sigma_w \mathbf{1}$ ), $H^\lozenge$ simplifies (Equation 1.9) involving Gaussian matrices $\mathcal{W}, \mathcal{X}_n, Z^G$ and diagonal matrices $D_{lin}(h), D_{chaos}(h)$ capturing the effect of the activation function and the data variance profile $\Upsilon_x$ . Proposition 1.1 ensures that $H$ and $H^\lozenge$ behave similarly asymptotically. The lozenge equivalents $E^\lozenge$ are derived by replacing $H, W, X_n$ with $H^\lozenge, \mathcal{W}, \mathcal{X}_n$ in the expressions for the risks (Theorem 3.1). These equivalents still involve taking expectations over the Gaussian matrices.
Square Equivalents ( $E^\square_{train}, E^\square_{test}$ ): These are deterministic equivalents that depend only on the model parameters ( $\lambda, \alpha^2, \sigma^2$ $λ, α^{2}, σ^{2}$ ), dimensions ( $n, p, m$ $n, p, m$ ), the activation function $h$ $h$ , and the variance profiles ( $\Gamma_n, \tilde{\Gamma}_n$ $Γ_{n}, \tilde{Γ}_{n}$ ). They are derived using techniques from free probability, specifically the linearization trick and results on random matrices with variance profiles (adapting [bigotmale]). The linearization involves constructing a larger matrix $L$ $L$ (Equation 2.6) whose resolvent's blocks relate to the terms needed for the risks. The deterministic equivalent $\mathfrak{Q}^\square(\Lambda)$ $Q^{□} (Λ)$ of the resolvent $\mathfrak{Q}(\Lambda) = (L-\Lambda)^{-1}$ $Q (Λ) = (L - Λ)^{- 1}$ is found by solving a fixed-point equation (Equation 3.3, Theorem 3.2). Theorem 3.3 shows that $|E_{train} - E^\square_{train}| \to 0$ $∣ E_{t r ain} - E_{t r ain}^{□} ∣ \to 0$ and $|E_{test} - E^\square_{test}| \to 0$ $∣ E_{t es t} - E_{t es t}^{□} ∣ \to 0$ under certain assumptions.
- Explicit Case: Under Assumption 1.1 (row sums of variance profiles are constant: $\frac{1}{p} \sum_j \gamma_{ij}^2 = s^2$ ) and if $\theta_{lin}(h) = \mathbb{E}[h'(\xi s)] = 0$ (e.g., for odd activation functions like $h(x)=x^3$ if $s$ is fixed), the square equivalents have explicit formulas (Equations 1.12, 1.13 / 3.5, 3.6) involving the Stieltjes transform $m_n(-\lambda)$ of the Marchenko-Pastur distribution and its derivative. The training risk matches results from [adlam2020neural] for the i.i.d. case.
- General Case: If $\theta_{lin}(h) \neq 0$ (or Assumption 1.1 doesn't hold), $E^\square$ involves traces of blocks of the matrix $\mathfrak{Q}^\square$ , which must be computed by numerically solving the fixed-point equation (Equation 3.3).

Practical Applications and Implementation:

Modeling Heterogeneous Data: The variance profile framework is useful for datasets where different samples or features exhibit different variability, such as in mixture models where data comes from multiple subpopulations with distinct statistical properties.
Mixture Model Example (MNIST): The paper demonstrates this by deriving variance profiles for each digit class in MNIST. Simulating data using these profiles, they show that RF ridge regression exhibits double descent. The numerically computed square equivalent $E^\square_{test}$ accurately predicts the simulated test error $E_{test}$ .
Understanding Double Descent: The framework allows studying the double descent phenomenon in these more realistic non-i.i.d. settings. Numerical experiments confirm that the location of the "interpolation peak" depends on the non-linearity of the activation function $h$ , similar to the i.i.d. case: near-linear $h$ tends to peak at $m \approx p$ , while highly non-linear $h$ peaks at $m \approx n$ .
Calculating Asymptotic Risks: Practitioners can use the derived square equivalents $E^\square$ $E^{□}$ to predict the performance of RF ridge regression without extensive simulations.
- If Assumption 1.1 holds and $\theta_{lin}(h)=0$ , use the explicit formulas (Eqs 3.5, 3.6).
- Otherwise, implement a fixed-point iteration algorithm to solve Equation 3.3 for $\mathfrak{Q}^\square$ and then compute $E^\square$ using the formulas in Theorem 3.3.
Limitations: The analysis relies on several technical assumptions, notably that the activation function $h$ is an odd polynomial (inherited from [DaboMale]). The ground truth model is assumed to be linear.

Key Assumptions for Implementation:

High-dimensional regime ( $n, p, m$ large and proportional).
Sub-exponential tails for underlying randomness (Assumption 2.1).
Activation function $h$ is an odd polynomial (Assumption 2.2).
Variance profiles are bounded (Assumption 2.3).
Spectral norms of key matrices are bounded (Assumption 2.4).
For explicit $E^\square$ formulas: Row-stochastic variance profiles (Assumption 1.1) and $\theta_{lin}(h)=0$ .

In summary, this paper provides a significant theoretical extension for analyzing RF regression to handle non-i.i.d. data via variance profiles. It offers practical asymptotic formulas ( $E^\square$ ) to predict training and test performance, demonstrated to be accurate in numerical experiments, including capturing double descent behavior in settings like mixture models. Computing these equivalents generally requires solving a matrix fixed-point equation numerically, except in a specific simplified case.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (2)

Tweets

https://twitter.com/StatMLPapers/status/1909095986376462625