Random lasso (1104.3398v1)

Published 18 Apr 2011 in stat.AP

Abstract: We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates' importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.

Citations (182)

View on Semantic Scholar

Summary

Random Lasso: An Advancement in Variable Selection for Linear Models

The paper "Random Lasso" by Wang et al. proposes an innovative enhancement to the traditional lasso method for variable selection in linear regression models. Unlike standard lasso techniques, which may falter in situations involving highly correlated variables or datasets with more predictors than observations, the random lasso method addresses these challenges through a computationally intensive, two-step bootstrapping approach.

Methodology and Implementation

The random lasso method builds on the foundational principles of the lasso by applying it repeatedly across numerous bootstrap samples, each time with a randomized subset of covariates. This two-step method proceeds as follows:

Generating Importance Measures: In the first step, bootstrap samples are drawn from the dataset. For each sample, a fixed number of covariates are selected randomly. The lasso is then applied to estimate regression coefficients, and an importance measure is computed for each covariate based on the average of these estimated coefficients across samples.
Variable Selection: The second step involves drawing a new set of bootstrap samples. Covariates are selected, this time with probability weights derived from their importance scores computed in the first step. The lasso, or optionally adaptive lasso, is then applied again, and final covariate coefficients are determined by averaging the results.

The importance of this methodology lies in its ability to systematically explore the contribution of each predictor, addressing the limitations of the traditional lasso, which tends to either include a single variable from a set of correlated variables or exclude them entirely.

Empirical Validation and Results

Through a series of simulation studies, the paper demonstrates the efficacy of random lasso across various scenarios, including those with correlated predictors and situations where the number of predictors exceeds the number of observations. The empirical evaluations reflect:

Superior prediction accuracy and variable selection frequency for random lasso in comparison to established methods like elastic-net, adaptive lasso, and relaxed lasso.
Effective handling of highly correlated predictors, optimizing selection and coefficient estimation even when predictors affect the response variable in contrasting ways.

The analysis of a real-world glioblastoma microarray dataset underscores the practical utility of random lasso. By selecting and evaluating a significant number of genes, the method showcases its robustness in complex biomedical data environments, revealing potential gene expressions tied to patient survival.

Implications and Future Directions

The random lasso method holds substantial implications for high-dimensional data analysis, especially in domains like genomics where the number of predictors can be vast, and their relationships intricate. The method's reliance on aggregation of models from bootstrap samples enriches the flexibility and robustness of the covariate selection process, enhancing generalizability and interpretability in predictive modeling.

Looking ahead, further exploration into optimizing the computational demands of the random lasso could bolster its applicability to even larger datasets. There is potential to refine the selection and weighting schemes for covariates to further minimize bias and variance in high-dimensional contexts. Moreover, integrating random lasso with other machine learning methodologies could yield hybrid models that capitalize on its superior variable selection capabilities.

In summary, the random lasso provides an enriched framework for variable selection that addresses some inherent limitations of existing lasso-based methods, with promising applications across diverse fields characterized by complex data structures.