Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 104 tok/s
GPT OSS 120B 474 tok/s Pro
Kimi K2 256 tok/s Pro
2000 character limit reached

High-dimensional ridge regression with random features for non-identically distributed data with a variance profile (2504.03035v1)

Published 3 Apr 2025 in stat.ML, cs.LG, math.PR, math.ST, stat.ME, and stat.TH

Abstract: The behavior of the random feature model in the high-dimensional regression framework has become a popular issue of interest in the machine learning literature}. This model is generally considered for feature vectors $x_i = \Sigma{1/2} x_i'$, where $x_i'$ is a random vector made of independent and identically distributed (iid) entries, and $\Sigma$ is a positive definite matrix representing the covariance of the features. In this paper, we move beyond {\CB this standard assumption by studying the performances of the random features model in the setting of non-iid feature vectors}. Our approach is related to the analysis of the spectrum of large random matrices through random matrix theory (RMT) {\CB and free probability} results. We turn to the analysis of non-iid data by using the notion of variance profile {\CB which} is {\CB well studied in RMT.} Our main contribution is then the study of the limits of the training and {\CB prediction} risks associated to the ridge estimator in the random features model when its dimensions grow. We provide asymptotic equivalents of these risks that capture the behavior of ridge regression with random features in a {\CB high-dimensional} framework. These asymptotic equivalents, {\CB which prove to be sharp in numerical experiments}, are retrieved by adapting, to our setting, established results from operator-valued free probability theory. Moreover, {\CB for various classes of random feature vectors that have not been considered so far in the literature}, our approach allows to show the appearance of the double descent phenomenon when the ridge regularization parameter is small enough.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper derives asymptotic equivalents for training and test risks in high-dimensional random features ridge regression applied to non-iid data with variable feature variances.
  • It introduces lozenge and square equivalents, where the square equivalent uses free probability to obtain deterministic risk formulas under specific variance profile assumptions.
  • The framework enables predicting double descent and practical performance in heterogeneous data settings, validated through mixture models such as MNIST.

This paper investigates the performance of Random Features (RF) ridge regression in a high-dimensional setting where the input data features are independent but not identically distributed (non-i.i.d.). This extends previous analyses that typically assume i.i.d. data. The non-i.i.d. structure is captured using the concept of a "variance profile," where each feature vector xix_i can have components with different variances, modeled as xi=Σi1/2xix_i = \Sigma_i^{1/2} x_i' with Σi=diag(γij2)\Sigma_i = \text{diag}(\gamma_{ij}^2). The data matrix can be written as Xn=ΥxXnX_n = \Upsilon_x \circ X_n', where Υx=(γij)\Upsilon_x = (\gamma_{ij}) and Γn=(γij2)\Gamma_n = (\gamma_{ij}^2) is the variance profile matrix.

The core problem is standard RF ridge regression: finding weights θ^λ\hat{\theta}_\lambda to minimize 1nYnHθ2+λθ2\frac{1}{n} \| Y_n - H^\top \theta \|^2 + \lambda \| \theta \|^2, where H=h(WXn/p)H = h(WX_n^\top / \sqrt{p}) is the matrix of activated random features, WW is a matrix of random weights (typically with i.i.d. entries, although the paper allows for a variance profile on WW too, specializing later to constant profile), and hh is an activation function. The analysis assumes the ground truth follows a linear model yi=xiβ+ϵiy_i = x_i^\top \beta_* + \epsilon_i with random coefficients β\beta_* (average case analysis).

The main contribution is the derivation of asymptotic equivalents for the training risk Etrain(λ)E_{train}(\lambda) and the prediction risk Etest(λ)E_{test}(\lambda) in the high-dimensional limit where the number of samples nn, features pp, and random features mm grow proportionally (p/ncp,m/ncmp/n \to c_p, m/n \to c_m). Two types of equivalents are derived:

  1. Lozenge Equivalents (Etrain,EtestE^\lozenge_{train}, E^\lozenge_{test}): These are obtained by leveraging a generalization of the "linear-plus-chaos" approximation for the matrix HH, denoted HH^\lozenge. This approximation, detailed in [DaboMale], accounts for the variance profiles on both XnX_n and WW. For the specific case considered where WW has a constant variance profile (Υw=σw1\Upsilon_w = \sigma_w \mathbf{1}), HH^\lozenge simplifies (Equation 1.9) involving Gaussian matrices W,Xn,ZG\mathcal{W}, \mathcal{X}_n, Z^G and diagonal matrices Dlin(h),Dchaos(h)D_{lin}(h), D_{chaos}(h) capturing the effect of the activation function and the data variance profile Υx\Upsilon_x. Proposition 1.1 ensures that HH and HH^\lozenge behave similarly asymptotically. The lozenge equivalents EE^\lozenge are derived by replacing H,W,XnH, W, X_n with H,W,XnH^\lozenge, \mathcal{W}, \mathcal{X}_n in the expressions for the risks (Theorem 3.1). These equivalents still involve taking expectations over the Gaussian matrices.
  2. Square Equivalents (Etrain,EtestE^\square_{train}, E^\square_{test}): These are deterministic equivalents that depend only on the model parameters (λ,α2,σ2\lambda, \alpha^2, \sigma^2), dimensions (n,p,mn, p, m), the activation function hh, and the variance profiles (Γn,Γ~n\Gamma_n, \tilde{\Gamma}_n). They are derived using techniques from free probability, specifically the linearization trick and results on random matrices with variance profiles (adapting [bigotmale]). The linearization involves constructing a larger matrix LL (Equation 2.6) whose resolvent's blocks relate to the terms needed for the risks. The deterministic equivalent Q(Λ)\mathfrak{Q}^\square(\Lambda) of the resolvent Q(Λ)=(LΛ)1\mathfrak{Q}(\Lambda) = (L-\Lambda)^{-1} is found by solving a fixed-point equation (Equation 3.3, Theorem 3.2). Theorem 3.3 shows that EtrainEtrain0|E_{train} - E^\square_{train}| \to 0 and EtestEtest0|E_{test} - E^\square_{test}| \to 0 under certain assumptions.
    • Explicit Case: Under Assumption 1.1 (row sums of variance profiles are constant: 1pjγij2=s2\frac{1}{p} \sum_j \gamma_{ij}^2 = s^2) and if θlin(h)=E[h(ξs)]=0\theta_{lin}(h) = \mathbb{E}[h'(\xi s)] = 0 (e.g., for odd activation functions like h(x)=x3h(x)=x^3 if ss is fixed), the square equivalents have explicit formulas (Equations 1.12, 1.13 / 3.5, 3.6) involving the Stieltjes transform mn(λ)m_n(-\lambda) of the Marchenko-Pastur distribution and its derivative. The training risk matches results from [adlam2020neural] for the i.i.d. case.
    • General Case: If θlin(h)0\theta_{lin}(h) \neq 0 (or Assumption 1.1 doesn't hold), EE^\square involves traces of blocks of the matrix Q\mathfrak{Q}^\square, which must be computed by numerically solving the fixed-point equation (Equation 3.3).

Practical Applications and Implementation:

  • Modeling Heterogeneous Data: The variance profile framework is useful for datasets where different samples or features exhibit different variability, such as in mixture models where data comes from multiple subpopulations with distinct statistical properties.
  • Mixture Model Example (MNIST): The paper demonstrates this by deriving variance profiles for each digit class in MNIST. Simulating data using these profiles, they show that RF ridge regression exhibits double descent. The numerically computed square equivalent EtestE^\square_{test} accurately predicts the simulated test error EtestE_{test}.
  • Understanding Double Descent: The framework allows studying the double descent phenomenon in these more realistic non-i.i.d. settings. Numerical experiments confirm that the location of the "interpolation peak" depends on the non-linearity of the activation function hh, similar to the i.i.d. case: near-linear hh tends to peak at mpm \approx p, while highly non-linear hh peaks at mnm \approx n.
  • Calculating Asymptotic Risks: Practitioners can use the derived square equivalents EE^\square to predict the performance of RF ridge regression without extensive simulations.
    • If Assumption 1.1 holds and θlin(h)=0\theta_{lin}(h)=0, use the explicit formulas (Eqs 3.5, 3.6).
    • Otherwise, implement a fixed-point iteration algorithm to solve Equation 3.3 for Q\mathfrak{Q}^\square and then compute EE^\square using the formulas in Theorem 3.3.
  • Limitations: The analysis relies on several technical assumptions, notably that the activation function hh is an odd polynomial (inherited from [DaboMale]). The ground truth model is assumed to be linear.

Key Assumptions for Implementation:

  • High-dimensional regime (n,p,mn, p, m large and proportional).
  • Sub-exponential tails for underlying randomness (Assumption 2.1).
  • Activation function hh is an odd polynomial (Assumption 2.2).
  • Variance profiles are bounded (Assumption 2.3).
  • Spectral norms of key matrices are bounded (Assumption 2.4).
  • For explicit EE^\square formulas: Row-stochastic variance profiles (Assumption 1.1) and θlin(h)=0\theta_{lin}(h)=0.

In summary, this paper provides a significant theoretical extension for analyzing RF regression to handle non-i.i.d. data via variance profiles. It offers practical asymptotic formulas (EE^\square) to predict training and test performance, demonstrated to be accurate in numerical experiments, including capturing double descent behavior in settings like mixture models. Computing these equivalents generally requires solving a matrix fixed-point equation numerically, except in a specific simplified case.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com