Distributed variable screening for generalized linear models (2405.04254v2)
Abstract: In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility.
- Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009), “Simultaneous analysis of Lasso and Dantzig selector,” The Annals of Statistics, 37, 1705 – 1732.
- Candes, E. and Tao, T. (2007), “The Dantzig selector: Statistical estimation when p𝑝pitalic_p is much larger than n𝑛nitalic_n,” The Annals of Statistics, 35, 2313–2351.
- Chen, J. and Chen, Z. (2012), “Extended BIC for small-n𝑛nitalic_n-large-P𝑃Pitalic_P sparse GLM,” Statistica Sinica, 22, 555–574.
- Chen, X., Ge, D., Wang, Z., and Ye, Y. (2014), “Complexity of unconstrained L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT minimization,” Mathematical Programming, 143, 371–383.
- Fan, J. and Lv, J. (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911.
- — (2010), “A selective overview of variable selection in high dimensional feature space,” Statistica Sinica, 20, 101–148.
- Fan, J. and Song, R. (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” The Annals of Statistics, 38, 3567 – 3604.
- Gao, Y., Liu, W., Wang, H., Wang, X., Yan, Y., and Zhang, R. (2022), “A review of distributed statistical inference,” Statistical Theory and Related Fields, 6, 89–99.
- Hao, M., Qu, L., Kong, D., Sun, L., and Zhu, H. (2021), “Optimal minimax variable selection for large-scale matrix linear regression model,” Journal of Machine Learning Research, 22, 1–39.
- Jordan, M. I., Lee, J. D., and Yang, Y. (2019), “Communication-efficient distributed statistical inference,” Journal of the American Statistical Association, 114, 668–681.
- Li, R., Zhong, W., and Zhu, L. (2012), “Feature screening via distance correlation learning,” Journal of the American Statistical Association, 107, 1129–1139.
- Li, X., Li, R., Xia, Z., and Xu, C. (2020), “Distributed feature screening via componentwise debiasing,” Journal of Machine Learning Research, 21, 1–32.
- Li, X. and Xu, C. (2023), “Feature screening with conditional rank utility for big-data classification,” Journal of the American Statistical Association, 1–11.
- Natarajan, B. K. (1995), “Sparse approximate solutions to linear systems,” SIAM journal on computing, 24, 227–234.
- Wang, H. (2009), “Forward regression for ultra-high dimensional variable screening,” Journal of the American Statistical Association, 104, 1512–1524.
- Xu, C. and Chen, J. (2014), “The sparse MLE for ultrahigh-dimensional feature screening,” Journal of the American Statistical Association, 109, 1257–1269.
- Yang, G., Yu, Y., Li, R., and Buu, A. (2016), “Feature screening in ultrahigh dimensional Cox’s model,” Statistica Sinica, 26, 881.
- Zhou, L., Gong, Z., and Xiang, P. (2023), “Distributed computing and inference for big data,” Annual Review of Statistics and Its Application.
- Zhou, T., Zhu, L., Xu, C., and Li, R. (2020), “Model-free forward screening via cumulative divergence,” Journal of the American Statistical Association, 115, 1393–1405.
- Zhu, L., Li, L., Li, R., and Zhu, L. (2011), “Model-free feature screening for ultrahigh-dimensional data,” Journal of the American Statistical Association, 106, 1464–1475.