Unregularized Least Squares Method

Updated 4 October 2025

Unregularized Least Squares Method is a linear regression approach that estimates parameters by minimizing the sum of squared residuals without adding penalty terms.
It employs algorithmic innovations like LU factorization and simplified Gram–Schmidt orthogonalization to compute coefficients efficiently and enhance numerical stability.
In high-dimensional settings, generalized estimators such as LAT and RAT enable reliable variable screening and support recovery under mild conditions without inducing shrinkage bias.

The unregularized least squares method, often referred to as ordinary least squares (OLS) or linear least squares (LLS), represents a foundational approach in statistical inference and data fitting, particularly for linear models. This method seeks parameter estimates that minimize the residual sum of squares without introducing explicit regularization (penalty) terms. Across classical and high-dimensional regimes, as well as in algorithmic innovations, OLS and its unregularized generalizations remain central to statistical theory and practice.

1. Mathematical Formulation and Classical Properties

Given a response vector $y \in \mathbb{R}^n$ and predictor matrix $X \in \mathbb{R}^{n \times p}$ , the unregularized least squares estimator solves:

$\hat{\beta}^{(OLS)} = \arg\min_{\beta \in \mathbb{R}^p} \|y - X\beta\|_2^2$

When $X^\top X$ is invertible (typically $n \gg p$ and $X$ is full-rank), the unique minimizer is:

$\hat{\beta}^{(OLS)} = (X^\top X)^{-1} X^\top y$

This estimator is unbiased, achieves minimum variance among linear unbiased estimators (the Gauss–Markov theorem), and has an explicitly computable covariance when errors are homoskedastic.

In the classical regime, properties such as support recovery and estimator error bounds are well understood. The method, however, encounters challenges when $p > n$ , because $X^\top X$ becomes singular and not invertible.

2. Algorithmic and Structural Innovations

Recent advances provide alternative characterizations and algorithmic solutions that avoid matrix inversion and explicit normalization, expanding the toolkit for unregularized least squares estimation.

LU Factorization without Inversion

A constructive approach decomposes the Gram matrix via LU factorization:

$X^\top X = L U$

If $U$ is the upper-triangular factor, individual coefficients can be calculated iteratively through back substitution:

For $i = p$ ,

$\hat{\beta}_p = u_{p,y} / u_{pp}$

For $i = 1, ..., p-1$ ,

$\hat{\beta}_i = \frac{u_{i,y}}{u_{ii}} - \sum_{j=i+1}^p \hat{\beta}_j \frac{u_{ij}}{u_{ii}}$

where $u_{i,y}$ are computed using inner products with the dependent variable $y$ . This approach circumvents matrix inversion, increases numerical stability for ill-conditioned problems, and allows selective coefficient computation (Madar et al., 2023).

Simplified Gram–Schmidt Orthogonalization

A related, normalization-free Gram–Schmidt procedure (“SGSO”) yields orthogonal vectors $\{q_i\}$ :

$q_1 = x_1, \quad q_i = x_i - \sum_{j=1}^{i-1} \frac{\langle q_j, x_i \rangle}{\langle q_j, q_j \rangle} q_j$

This yields a triangular matrix $U=Q^\top X$ , and the OLS coefficients can be recovered as:

$\hat{\beta}_i = (q_i^\circ)^\top \left[ \prod_{k=i+1}^p (I - x_k (q_k^\circ)^\top) \right] y,\quad q_i^\circ = q_i / \langle q_i, q_i \rangle$

This iterative, projection-based procedure avoids explicit normalization and enables efficient algorithmic implementation (Madar et al., 2023).

3. High-Dimensional Generalizations without Penalization

In high-dimensional settings ( $p > n$ ), classical OLS breaks down due to singularity of $X^\top X$ . A generalized estimator, motivated by ridge regression, is constructed as follows:

$\hat{\beta}^{(HD)} = \lim_{r \to 0} X^\top (XX^\top + r I_n)^{-1} y = X^\top (XX^\top)^{-1} y$

where inversion is performed on the $n \times n$ matrix $XX^\top$ , which can be full-rank even for $p > n$ .

This “in-projection” estimator captures the component of $y$ in the row space of $X$ , multiplied back into predictor space. Notably, the entries of $\hat{\beta}^{(HD)}$ have proven effective for variable screening and selection:

In sparse models, the magnitude $\{|\hat{\beta}_i^{(HD)}|\}$ separates strong from weak predictors with high probability, as formalized in thresholding inequalities.

Two three-stage, non-iterative algorithms are built on this estimator [Editor’s term: “generalized unregularized OLS”]:

LAT (Least-squares Adaptive Thresholding): (i) Standardize data, compute $\hat{\beta}^{(HD)}$ , select top $d$ variables; (ii) fit OLS in selected submodel, hard threshold small coefficients; (iii) refit and finalize model.
RAT (Ridge Adaptive Threshold): Similar, but uses ridge regression in stages 2–3 to improve conditioning (Wang et al., 2015).

Compared to $\ell_1$ -based methods, these approaches:

Do not require penalty tuning.
Avoid shrinkage-induced bias.
Rely on mild conditions for support recovery (finite noise variance, manageable condition number).
Offer non-iterative, parallelizable algorithms.

4. Alternative Geometric Interpretations: Least Squares as Random Walks

A geometric and statistical reinterpretation frames unregularized LLS in terms of the net area annihilation of a “data walk.” For $N$ equispaced points $(x_k, y_k)$ , define mean-adjusted values $y_k' = y_k - \bar{y}$ and their cumulative sum (data walk) $z_j = \sum_{k=1}^j (y_k' )$ , with $z_0=0$ .

The trend in the data is identified as the slope $\alpha$ that zeroes the signed area under the data walk:

$A(y) = -\sum_{j=1}^N z_j$

For a linear trend $y_k = \alpha x_k$ , $A(y) = \alpha A(x)$ , with $A(x) = N(N+1)/12$ . Thus the area-balancing slope estimate is:

$\alpha = - \frac{12}{N(N+1)} \sum_{j=1}^N z_j$

It is shown that this expression for $\alpha$ is algebraically identical to the conventional LLS slope for uniform $x_k$ . This equivalence reveals least squares as the “detreading” operation that balances a cumulative walk, akin to setting the net area of a Brownian bridge to zero (Kostinski et al., 26 Mar 2025).

This geometric perspective:

Provides an intuitive, visual framework for understanding LLS in terms of random walks.
Is robust to noise distribution (Gaussian or otherwise), being purely based on summations.
Admits reinterpretation of standard error and statistical significance in terms of random walk theory.
Invites the use of stochastic process tools for further statistical analysis.

5. Computational and Theoretical Properties

The unregularized least squares estimators retain several computational and theoretical advantages:

Non-iterative Computation: LU factorization and SGSO avoid inversion and enable efficient back or forward substitution.
Parallelization: Matrix multiplications and screening steps in LAT/RAT are suitable for parallel computing architectures (Wang et al., 2015).
Numerical Stability: Avoiding explicit inversion mitigates the effects of ill-conditioning in design matrices (Madar et al., 2023).
Support Recovery and Error Bounds: In high-dimensional settings, under suitable conditions (mild assumptions on condition number and noise variance), the LAT and RAT algorithms achieve:
- Reliable strong/weak signal separation in screening.
- Rate-optimal recovery of true support.
- Error bounds such as $\ell_\infty$ -error scaling with $\sigma \sqrt{\log p / n^\alpha}$ .

Numerical experiments in diverse settings (independent predictors, compound symmetry, group structures, real data) confirm that generalized OLS-based three-stage methods yield competitive RMSE and improved computational times over penalization-based estimators.

6. Practical Applications and Model Selection

Unregularized least squares methodologies, both classical and high-dimensional, are applied extensively in regression analysis, signal recovery, and exploratory screening when explicit interpretability and unbiasedness are desired.

Key practical features include:

Flexibility in high-dimensional settings via generalized estimators (Wang et al., 2015).
The ability to compute coefficients directly, or selectively, using LU or SGSO approaches (Madar et al., 2023).
Geometric and visualization-aiding interpretations for trend removal in time series and sequential data analysis (Kostinski et al., 26 Mar 2025).
Model selection via ranking and thresholding, augmented by data-adaptive thresholds (e.g., using estimated noise variance and log-factors reflecting multiple testing corrections).

These advances enable efficient, interpretable model fitting in analytic, computational, and applied contexts without the need for regularization.

7. Comparison with Penalized and Regularized Approaches

Unregularized least squares differs fundamentally from penalized frameworks (such as lasso, SCAD, and ridge regression):

No penalty is imposed; coefficient bias due to shrinkage is thus avoided.
LAT/RAT algorithms sidestep the need for careful penalty parameter tuning.
Fewer probabilistic assumptions are required (finite noise variance suffices; sub-Gaussianity and strong irrepresentability are not necessary).
Theoretical support recovery and consistency are delivered under milder conditions.
Computation is rooted in linear algebraic primitives suited for high-throughput environments.

A plausible implication is that in problems where explicit sparsity penalization may introduce an undesirable bias or where tuning is impractical, these unregularized frameworks offer efficiency, clarity, and strong support recovery, as reflected in their empirical and theoretical performance (Wang et al., 2015, Madar et al., 2023, Kostinski et al., 26 Mar 2025).