Data-Driven Logistic Regression Ensembles With Applications in Genomics
Abstract: Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Effective statistical methods should excel in both prediction accuracy and biomarker identification. We introduce a novel approach to high-dimensional binary classification that integrates regularization with ensembling techniques. Our method constructs compact ensembles of interpretable models derived by optimizing a global objective function. In medical genomics applications, our approach identifies critical biomarkers overlooked by competing methods. We develop a variable importance ranking system to help researchers prioritize promising genes. The method's asymptotic properties are established, and an efficient computational algorithm is provided. Through extensive simulations across complex scenarios and analysis of genomics datasets for cancer, multiple sclerosis, and psoriasis, we demonstrate strong predictive performance. Based on our numerical experiments, we offer practical guidelines for determining optimal ensemble size.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.