High-Dimensional Regression

Updated 11 October 2025

High-dimensional regression is a statistical framework for cases when the number of predictors exceeds the sample size, relying on sparsity to reduce complexity.
It employs non-asymptotic oracle bounds and minimax theory to achieve low prediction risk using methods like square-root Lasso and LinSelect.
Adaptive tuning and data-driven estimator selection enhance variable recovery and computational efficiency in modern applications.

High-dimensional regression refers to statistical modeling and inference in regression settings where the number of covariates (p) is comparable to or exceeds the number of observations (n), with particular emphasis on scenarios where p ≫ n. Such regimes arise in genomics, image processing, economics, and many modern experimental sciences. Key distinguishing features of high-dimensional regression include the breakdown of classical consistency guarantees, the necessity of sparsity or low-dimensional structure assumptions, and the centrality of non-asymptotic (finite-sample) analysis and robust, data-driven tuning procedures.

1. Statistical Framework and Notions of Sparsity

The canonical problem is linear regression: $Y = X\beta_0 + \varepsilon$ where $Y \in \mathbb{R}^n$ , $X \in \mathbb{R}^{n \times p}$ , $\beta_0 \in \mathbb{R}^p$ is the unknown signal, and $\varepsilon \sim N(0, \sigma^2 I_n)$ with unknown noise variance $\sigma^2$ . The emphasis is on achieving low prediction risk

$\mathbb{E}\left[\|X(\hat{\beta} - \beta_0)\|_2^2\right]$

even in the case of unknown $\sigma^2$ , which precludes the use of standard plug-in penalty levels in regularization.

To overcome the curse of dimensionality, structural assumptions are imposed:

Coordinate sparsity: Only $k \ll p$ entries of $\beta_0$ are nonzero. Risk bounds then scale as $C k \log p \ \sigma^2$ , reflecting both sparsity and model selection complexity.
Group sparsity: The covariates are partitioned into groups, and entire groups are either active or inactive. For group structure $G_1, \ldots, G_M$ , estimation often involves a group-Lasso penalty:

$\min_\beta \|Y - X\beta\|_2^2 + \sum_k \lambda_k \|\beta^{(G_k)}\|_2$

Variation sparsity: The difference vector $v_j = \beta_{0,j+1} - \beta_{0,j}$ is sparse. Problems such as signal segmentation (when $X = I$ ) are included here.

Each sparsity type requires distinct estimation and regularization approaches.

2. Non-Asymptotic Oracle Bounds and Minimax Theory

In non-asymptotic analysis, risk bounds and optimality must hold for finite n, p, and k. The minimax prediction risk for coordinate-sparse $k$ is

$R_{\text{minimax}} \sim [k \log(p/k) \wedge n] \, \sigma^2$

imposing the classical tradeoff that high-dimensional adaptation is feasible when $k \log p / n \ll 1$ (the “non-ultra-high-dimensional” setting). In the regime $k \log p \gtrsim n$ ("ultra-high-dimensional"), adaptation to both unknown variance and sparsity incurs additional risk.

Oracle inequalities of the type

$\mathbb{E}\left[\|X(\hat{\beta} - \beta_0)\|_2^2\right] \leq C_1 \inf_{\beta \neq 0} \left\{\|X(\beta - \beta_0)\|_2^2 + \|\beta\|_0 \log p \, \sigma^2\right\}$

quantify estimator performance relative to an oracle knowing the true active set. Group and variation sparsity structures yield analogous minimax and oracle forms, with the complexity terms reflecting group cardinalities or jump counts, respectively.

3. Pivotal and Adaptation Strategies: Tuning without Known Variance

In high-dimensional regimes, penalty levels (e.g., in Lasso, group-Lasso) canonically depend on unknown $\sigma$ . Approaches to bypass unknown variance include:

Ad-hoc pivotalization: Modify estimators so that their tuning parameter is independent of $\sigma$ .

Square-root Lasso (a.k.a. scaled Lasso) replaces the penalized least squares objective with:

$\hat{\beta}_\lambda^{\mathrm{SR}} = \arg\min_{\beta \in \mathbb{R}^p} \left\{ \sqrt{\|Y - X\beta\|_2^2} + \frac{\lambda}{\sqrt{n}}\|\beta\|_1 \right\}$

For $\lambda = c \sqrt{2 \log p}$ , this estimator—under compatibility conditions such as $\kappa[\xi, T]$ —achieves nearly optimal oracle bounds with high probability:

$\|X(\hat{\beta}^{\mathrm{SR}} - \beta_0)\|_2^2 \leq \inf_{\beta \neq 0} \left\{\|X(\beta - \beta_0)\|_2^2 + C \frac{\|\beta\|_0 \log p}{\kappa^2[4, \mathrm{supp}(\beta)]} \sigma^2\right\}$

Generalization to group penalties is achieved through analogous square-root or pivotal forms.

Data-driven estimator selection: Build a collection of candidate estimators over a grid of tuning parameters and select among them using a non-asymptotic, data-adaptive criterion.

Cross-validation (e.g. 10-fold) remains a standard, especially when computational resources are not the bottleneck.
LinSelect introduces a criterion

$\operatorname{Crit}(\lambda) = \inf_{S \in \mathbb{S}} \left\{ \|Y - \Pi_S(X\hat{\beta}_\lambda)\|_2^2 + \frac{1}{2}\|X\hat{\beta}_\lambda - \Pi_S(X\hat{\beta}_\lambda)\|_2^2 + \operatorname{pen}(S)\hat{\sigma}_S^2 \right\}$

with $\mathbb{S}$ a suitable family of subspaces and $\operatorname{pen}(S)$ reflecting model complexity (typically involving log-binomial terms in dimension). LinSelect’s theoretical guarantee: the risk of the selected estimator is close to the oracle risk in the candidate estimator family, and it is computationally highly efficient.

4. Empirical Assessments of Tuning Procedures

Simulation studies (n = p = 100, 165 synthetic regression settings) enable direct risk ratio and support recovery comparisons:

Prediction tasks: Both 10-fold CV and LinSelect produce risk ratios close to 1 (median risk not exceeding the oracle); square-root Lasso exhibits generally higher—sometimes substantially—risk ratios and higher variance.
Variable selection: The Gauss-Lasso (applying least-squares on the Lasso support) with LinSelect tuning yields lower false discovery rates compared to CV; square-root Lasso gives low FDR but can also decrease power. This illustrates a nuanced tradeoff between power and error control that is sensitive to the choice of the tuning algorithm.
Computational efficiency: LinSelect and square-root Lasso significantly outperform cross-validation in computation time, which is critical as n increases or when models must be tuned repeatedly.

Tuning Procedure	Prediction Risk Ratio (Median)	Variable Selection FDR	Computational Time
LinSelect	~1 (oracle-level)	Low	Fast
10-fold CV	~1 (oracle-level)	Moderate	Slow (esp. for large n)
Square-root Lasso	Higher median, higher variance	Low	Fast

5. Extensions: Multivariate and Nonparametric High-Dimensional Regression

The key issues and techniques extend beyond univariate linear models:

Gaussian graphical models: Methods designed for fixed-X regression (e.g., square-root Lasso, LinSelect) can be applied conditional on X, but risk should be measured “integrated” over the design (e.g., with Σ½).
Multivariate regression: The parameter is now a matrix $B_0$ , with structural assumptions such as row-sparsity (group-sparse) or low-rank. Analogous pivotalization (e.g., square-root group-Lasso, nuclear norm penalties) and non-asymptotic risk bounds can be achieved.
Nonparametric regression: Bandwidth or smoothing parameter selection (analogous to tuning regularization) is central. Non-asymptotic selector procedures such as the slope heuristic or LinSelect are adapted to linear estimators, including kernel and spline smoothers, ensuring proper variance adaptation.

This illustrates a broader principle: the challenge of simultaneous adaptation to unknown sparsity and variance in high-dimensional regimes is not specific to linear models but is ubiquitous across modern statistics.

6. Fundamental Limits and Mathematical Expressions

Central mathematical constructs include:

Prediction risk:

$\mathcal{R}[\hat{\beta}; \beta_0] = \mathbb{E}_{\beta_0}\left[\|X(\hat{\beta} - \beta_0)\|_2^2\right]$

Oracle inequalities:

$\mathcal{R}[\hat{\beta};\beta_0] \leq C \|\beta_0\|_0 \log p \ \sigma^2$

$\mathcal{R}[\hat{\beta};\beta_0] \leq C_1 \inf_{\beta\neq 0}\left\{\|X(\beta-\beta_0)\|_2^2 + \|\beta\|_0 \log p \ \sigma^2\right\}$

Key estimator definitions:
- Square-root Lasso:
$\hat{\beta}^{(\mathrm{SR})}_\lambda = \underset{\beta \in \mathbb{R}^p}{\arg\min}\ \sqrt{\|Y - X\beta\|_2^2} + \frac{\lambda}{\sqrt{n}}\|\beta\|_1$ - Group-Lasso:

$\hat{\beta}_{\lambda} = \underset{\beta}{\arg\min}\ \|Y-X\beta\|_2^2 + \sum_k \lambda_k \|\beta^{G_k}\|_2$ - LinSelect criterion:

$\operatorname{Crit}(\lambda) = \inf_{S\in \mathbb{S}} \left\{\|Y - \Pi_S(X\hat{\beta}_\lambda)\|_2^2 + \frac{1}{2}\|X\hat{\beta}_\lambda - \Pi_S(X\hat{\beta}_\lambda)\|_2^2 + \operatorname{pen}(S)\hat{\sigma}_S^2\right\}$

7. Significance and Outlook

High-dimensional regression with unknown variance integrates non-asymptotic statistical theory, pivotalization of tuning, and modern selection procedures. The analysis reveals that while powerful methods such as Lasso and group-Lasso facilitate sparse estimation, their effectiveness in real-world high-dimensional settings depends critically on adaptive and computationally efficient tuning algorithms that do not require knowledge of the noise level. The square-root Lasso and LinSelect exemplify feasible, theoretically justified strategies. Extensive empirical studies confirm that these procedures achieve near-oracle prediction risk, robust variable selection, and scalability. The principles and estimator construction generalize to various complex settings, ensuring that adaptive non-asymptotic methodology remains at the forefront of high-dimensional inference (Giraud et al., 2011).

PDF Markdown Chat (Pro)

References (1)

High-dimensional regression with unknown variance (2011)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to High-Dimensional Regression.