On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

Published 10 Jun 2020 in stat.ML, cs.LG, math.ST, and stat.TH | (2006.05800v4)

Abstract: We consider the linear model $\mathbf{y} = \mathbf{X} \mathbf{\beta}\star + \mathbf{\epsilon}$ with $\mathbf{X}\in \mathbb{R}^{n\times p}$ in the overparameterized regime $p>n$. We estimate $\mathbf{\beta}\star$ via generalized (weighted) ridge regression: $\hat{\mathbf{\beta}}\lambda = \left(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{\Sigma}_w\right)^\dagger \mathbf{X}^{T\mathbf{y}$,} where $\mathbf{\Sigma}_w$ is the weighting matrix. Under a random design setting with general data covariance $\mathbf{\Sigma}_x$ and anisotropic prior on the true coefficients $\mathbb{E}\mathbf{\beta}\star\mathbf{\beta}\star^T = \mathbf{\Sigma}\beta$, we provide an exact characterization of the prediction risk $\mathbb{E}(y-\mathbf{x}^{T\hat{\mathbf{\beta}}_\lambda)^2$} in the proportional asymptotic limit $p/n\rightarrow \gamma \in (1,\infty)$. Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $\lambda_{\rm opt}$ for the ridge parameter $\lambda$ and confirm the implicit $\ell_2$ regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that $\lambda_{\rm opt}$ can be negative in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when both $\mathbf{X}$ and $\mathbf{\beta}\star$ are anisotropic. Finally, we determine the optimal weighting matrix $\mathbf{\Sigma}_w$ for both the ridgeless ($\lambda\to 0$) and optimally regularized ($\lambda = \lambda{\rm opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.