Nonparametric Bayesian Knockoff Generators for Feature Selection Under Complex Data Structure (2111.06985v2)
Abstract: The recent proliferation of high-dimensional data, such as electronic health records and genetics data, offers new opportunities to find novel predictors of outcomes. Presented with a large set of candidate features, interest often lies in selecting the ones most likely to be predictive of an outcome for further study. Controlling the false discovery rate (FDR) at a specified level is often desired in evaluating these variables. Knockoff filtering is an innovative strategy for conducting FDR-controlled feature selection. This paper proposes a nonparametric Bayesian model for generating high-quality knockoff copies that can improve the accuracy of predictive feature identification for variables arising from complex distributions, which can be skewed, highly dispersed and/or a mixture of distributions. This paper provides a detailed description for generating knockoff copies from a GDPM model via MCMC posterior sampling. Additionally, we provide a theoretical guarantee on the robustness of the knockoff procedure. Through simulations, the method is shown to identify important features with accurate FDR control and improved power over the popular second-order Gaussian knockoff generator. Furthermore, the model is compared with finite Gaussian mixture knockoff generator in FDR and power. The proposed technique is applied for detecting genes predictive of survival in ovarian cancer patients using data from The Cancer Genome Atlas (TCGA).