LOCO Estimation: Feature Importance & Inference

Updated 21 August 2025

LOCO Estimation is a model-agnostic method for quantifying feature importance by measuring the change in predictive performance when a feature is removed.
It provides robust predictive inference in high-dimensional settings through techniques like leave-one-out residuals and penalized estimator comparisons.
Extensions such as interaction LOCO and conditional tests enhance variable selection and control error rates in complex modeling scenarios.

Leave-One-Covariate-Out (LOCO) Estimation is a statistical and machine learning methodology that quantifies the importance of a covariate (feature) by assessing the change in a predictive model’s performance when that covariate is removed. This wrapper-type approach is notable for its model-agnostic nature, robust inferential properties in high-dimensional contexts, and extensibility to interaction and conditional inference. The LOCO principle underlies recent developments in predictive interval construction, causal inference, randomized experiments, conditional independence testing, and variable selection in both low- and high-dimensional regimes.

1. Mathematical Definition and Core Principles

LOCO estimation procedures compare predictive performance metrics between a model fit on the full covariate set and models fit after excluding one or more covariates. For regression problems, a canonical LOCO importance parameter for feature $X_j$ is defined as

$\psi_{0,j}^{\text{loco}} = V(f_0, P_0) - V(f_{0,-j}, P_{0,-j})$

where $f_0$ is the full model predictor, $f_{0,-j}$ is the model learned without $X_j$ , and $V(\cdot, \cdot)$ denotes a performance metric (e.g., mean squared error) under distributions $P_0$ and $P_{0,-j}$ (Zheng et al., 19 Aug 2025). The empirical estimator is

$\hat{\psi}_{0,j}^{\text{loco}} = \frac{1}{n} \sum_{i=1}^n \left\{[Y_i - f_n(X_i)]^2 - [Y_i - f_{n,-j}(X_{i,-j})]^2\right\}$

where $f_n$ and $f_{n,-j}$ are the fitted models on all features and the subset excluding $j$ , respectively.

In high-dimensional regularization contexts (e.g., LASSO), LOCO estimation quantifies the difference in penalized coefficient paths when a feature is omitted. If $\widehat{\beta}(\lambda)$ is the LASSO solution at regularization parameter $\lambda$ , the LOCO path statistic for the $j$ th feature is

$T_j(s,t) = \|\widehat{\beta} - \widehat{\beta}^{(-j)}\|_{(s,t)}$

aggregating the discrepancy across the path (Cao et al., 2020).

2. LOCO in High-Dimensional Regression and Prediction Interval Construction

Prediction intervals with uniform asymptotic validity in high-dimensional settings are constructed via LOCO leave-one-out residuals. For observations $y_k = \beta'x_k + u_k$ , a point prediction for a new $y_0$ is $x_0'\hat{\beta}$ . The conditional distribution of $y_0 - x_0'\hat{\beta}$ is approximated using leave-one-out residuals:

$\tilde{u}_i = y_i - x_i'\hat{\beta}_{(i)}$

with $\hat{\beta}_{(i)}$ the estimator excluding the $i$ th data point. Empirical quantiles of $\{\tilde{u}_i\}$ define the prediction interval:

$PI_{\alpha}^{\text{L1O}}(Y, X, x_0) = [x_0'\hat{\beta} + \tilde{q}_{n,\alpha/2},\ x_0'\hat{\beta} + \tilde{q}_{n,1-\alpha/2}]$

This interval is shown to achieve asymptotic nominal coverage for a broad class of estimators, including least squares, robust M-estimators, shrinkage methods, and penalized procedures like LASSO—even when $p \sim n$ or $p > n$ (Steinberger et al., 2016).

Key conditions:

Invariance to data ordering, ensuring residual exchangeability
Scaled estimation error converges to a constant limit
Influence of any observation on $\hat{\beta}$ is asymptotically negligible

The LOCO interval adapts its length to estimator performance via a parameter $\tau$ quantifying error magnitude; more accurate predictors yield shorter intervals.

3. LOCO in Covariate Adjustment for Experiments

LOCO motivates estimator designs for randomized trials aiming to adjust for baseline covariate imbalances:

LOOP Estimator (Leave-One-Out Potential outcomes): For each unit, exclude it from the imputation dataset and predict its treatment/control outcomes using flexible algorithms (e.g., random forests). The individual treatment effect estimator is

$\hat{\tau}_i = (Y_i - \hat{m}_i) U_i$

with $\hat{m}_i$ constructed via leave-one-out prediction, and $U_i$ the signed inverse probability weight. The average effect is $\hat{\tau} = \frac{1}{N}\sum_i \hat{\tau}_i$ . This design-based estimator is exactly unbiased under the Neyman–Rubin potential outcomes framework and enhances precision via automatic variable selection when using machine learning methods. Variance formulas quantify gains over standard difference estimators (Wu et al., 2017).

P-LOOP Estimator: Extends LOCO to paired experiments using leave-one-pair-out imputations, balancing precision with respect to pair assignments—avoiding overadjustment and guarding against variance increases from modeling the pair structure unnecessarily (Wu et al., 2019).

4. LOCO in Conditional Independence and Hypothesis Testing

Conditional randomization tests and variable importance assessment are generalized via leave-one-covariate-out approaches:

LOCO Conditional Randomization Test (LOCO CRT): For each variable, construct a test statistic by comparing predictive performance with and without the variable, generating reference distributions by randomizing the left-out covariate. Valid p-values for individual features can be computed using the proportion of randomized statistics exceeding the observed one:

$p_j = \frac{1 + \sum_{b=1}^B I\{T_j^{(b)} \geq T_{obs}\}}{B+1}$

Familywise error rate is controlled directly. For L1-regularized M-estimators, the L1ME CRT variant leverages the stability of cross-validated lasso selections for computational speed. In multivariate Gaussian designs, closed-form p-values eliminate resampling (Katsevich et al., 2020).

LOCO in High-Dimensional Penalized Inference: LOCO measures the impact of individual variables on the whole regularization path, allowing for simultaneous hypothesis testing and variable screening with robust bootstrap calibrations (Cao et al., 2020).

5. LOCO for Feature Importance, Interaction, and Efficiency Comparisons

Feature importance estimation in model-agnostic and black-box models relies on LOCO metrics:

LOCO is defined as the performance change (e.g., increase in MSE) when a feature is omitted—a nonparametric analog of $R^2$ (Verdinelli et al., 2021). Decorrelated LOCO variants address interpretation difficulties due to covariate correlation by targeting parameters where covariates are rendered independent either by reweighting or semiparametric projection (e.g., $\psi_2 = \beta^\top \Sigma_x \beta$ ).
Interaction LOCO (iLOCO): Quantifies the effect of pairwise (or higher-order) feature interactions via the difference in LOCO statistics:

$\text{iLOCO}_{j,k} = \Delta_{j} + \Delta_{k} - \Delta_{j,k}$

where $\Delta_j$ measures error increase when $j$ is removed and $\Delta_{j,k}$ when both are removed. Ensemble minipatch methods efficiently compute iLOCO and corresponding confidence intervals in large datasets (Little et al., 10 Feb 2025).

Comparisons with Regression-Based Measures (Generalized Covariance Measure [GCM]): LOCO requires retraining for each feature omitted and produces easily interpretable importance metrics. GCM uses residual covariance between features and outcomes given the rest, enjoying efficiency advantages (lower coefficient of variation) in linear, additive, and single-index models. For linear regression:

$\text{LOCO cv} \sim \frac{1}{\sqrt{n}\beta_j^2}\sqrt{\text{Var}(\tilde{X}_j^2) + 4\sigma^2/\beta_j^2 E(\tilde{X}_j^2)}$

$\text{GCM cv} \sim \frac{1}{\sqrt{n}\beta_j}\sqrt{\text{Var}(\tilde{X}_j^2) + \sigma^2/\beta_j^2 E(\tilde{X}_j^2)}$

where $\tilde{X}_j = X_j - E[X_j|X_{-j}]$ (Zheng et al., 19 Aug 2025).

6. LOCO in Information-Theoretic Generalization Bounds

LOCO estimation links to generalization theory via mutual information measures. Leave-one-out conditional mutual information (CMI) between loss vectors and held-out sample index controls generalization error:

$I(L;U|Z)$

For interpolating algorithms under 0–1 loss, risk is bounded by

$R \leq \frac{I(L;U|Z)}{\log(n+1)}$

and connections are made between leave-one-out error, conditional entropy, and risk (Haghifam et al., 2022). Applications include optimal generalization bounds for VC classes using the one-inclusion graph algorithm.

7. Practical Considerations, Limitations, and Extensions

LOCO methodologies are widespread in modern statistical analysis, applicable to regression, causal inference, experimental design, machine learning model interpretation, and feature selection. They are computationally intensive when retraining is needed for every feature omitted, motivating efficient variants—e.g., ensemble minipatch approaches, out-of-bag predictions, dropout approximations, and Lazy–VI in neural networks. Limitations include susceptibility to covariance structure (e.g., masking of correlated features) and efficiency losses relative to regression-based measures under certain regularity conditions.

Extensions include distribution-free inference for feature interactions (iLOCO), decorrelation techniques for variable importance, familywise error rate control in conditional independence testing, and robust predictive interval estimation in high-dimensional settings.

LOCO remains a central approach for model-agnostic assessment of variable contribution, predictive uncertainty quantification, and interpretable machine learning—continually advanced by theoretical results and scalable algorithms in contemporary research.