A model-free subdata selection method for classification (2404.19127v1)
Abstract: Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
- Baldi, P., Sadowski, P., and Whiteson, D. (2014), “Searching for exotic particles in high-energy physics with deep learning,” Nature communications, 5, 4308.
- Blackard, J. (1998), “Covertype,” UCI Machine Learning Repository, DOI: https://doi.org/10.24432/C50K5N.
- Breiman, L. (1996a), “Bagging predictors,” Machine learning, 24, 123–140.
- — (1996b), “Bias, variance, and arcing classifiers,” .
- Chang, M.-C. (2023), “Predictive Subdata Selection for Computer Models,” Journal of Computational and Graphical Statistics, 32, 613–630.
- — (2024), “Predictive Subdata Selection for Computer Models,” Journal of Computational and Graphical Statistics, 32, 613–630.
- Chen, X., and Xie, M.-G. (2014), “A split-and-conquer approach for analysis of extraordinarily large data,” Statistica Sinica, 24, 1655–1684.
- Cheng, Q., Wang, H., and Yang, M. (2020), “Information-based optimal subdata selection for big data logistic regression,” Journal of Statistical Planning and Inference, 209, 112–122.
- Deng, L.-Y., Yang, C.-C., Bowman, D., Lin, D. K. J., and Lu, H. H.-S. (2023), “Big Data Model Building Using Dimension Reduction and Sample Selection,” Journal of Computational and Graphical Statistics, 0, 1–13.
- Joseph, V. R., and Mak, S. (2021), “Supervised compression of big data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 14, 217–229.
- Joseph, V. R., and Vakayil, A. (2022), “SPlit: An optimal method for data splitting,” Technometrics, 64, 166–176.
- Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (2014), “A scalable bootstrap for massive data,” Journal of the Royal Statistical Society: Series B: Statistical Methodology, 76, 795–816.
- Klusowski, J. M. (2021), “Universal Consistency of Decision Trees for High Dimensional Additive Models,” arXiv preprint arXiv:2104.13881.
- Lin, N., and Xi, R. (2011), “Aggregated estimating equation estimation,” Statistics and Its Interface, 4, 73–83.
- Ma, P., Mahoney, M. W., and Yu, B. (2015), “A statistical perspective on algorithmic leveraging,” The Journal of Machine Learning Research, 16, 861–911.
- Ma, P., and Sun, X. (2015), “Leveraging for big data regression,” Wiley Interdisciplinary Reviews: Computational Statistics, 7, 70–76.
- Mak, S., and Joseph, V. R. (2018), “Support points,” The Annals of Statistics, 46, 2562–2592.
- Meng, C., Zhang, X., Zhang, J., Zhong, W., and Ma, P. (2020), “More efficient approximation of smoothing splines via space-filling basis selection,” Biometrika, 107, 723–735.
- Probst, P., Wright, M. N., and Boulesteix, A.-L. (2019), “Hyperparameters and tuning strategies for random forest,” Wiley Interdisciplinary Reviews: data mining and knowledge discovery, 9, e1301.
- Sani, H. M., Lei, C., and Neagu, D. (2018), “Computational complexity analysis of decision tree algorithms,” in Artificial Intelligence XXXV: 38th SGAI International Conference on Artificial Intelligence, AI 2018, Cambridge, UK, December 11–13, 2018, Proceedings 38, Springer, pp. 191–197.
- Schifano, E. D., Wu, J., Wang, C., Yan, J., and Chen, M.-H. (2016), “Online updating of statistical inference in the big data setting,” Technometrics, 58, 393–403.
- Scornet, E., Biau, G., and Vert, J.-P. (2015), “Consistency of random forests,” The Annals of Statistics, 43, 1716–1741.
- Shi, C., and Tang, B. (2021), “Model-robust subdata selection for big data,” Journal of Statistical Theory and Practice, 15, 1–17.
- Singh, R., and Stufken, J. (2023), “Subdata Selection With a Large Number of Variables,” The New England Journal of Statistics in Data Science, 1, 426–438.
- Song, Q., and Liang, F. (2015), “A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression,” Journal of the Royal Statistical Society: Series B: Statistical Methodology, 77, 947–972.
- Ting, D., and Brochu, E. (2018), “Optimal subsampling with influence functions,” in Advances in neural information processing systems, pp. 3650–3659.
- Vakayil, A., and Joseph, V. R. (2022), “Data twinning,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 15, 598–610.
- Wang, C., Chen, M.-H., Schifano, E., Wu, J., and Yan, J. (2016), “Statistical methods and computing for big data,” Statistics and its interface, 9, 399.
- Wang, H., Yang, M., and Stufken, J. (2019), “Information-based optimal subdata selection for big data linear regression,” Journal of the American Statistical Association, 114, 393–405.
- Wang, H., Zhu, R., and Ma, P. (2018), “Optimal subsampling for large sample logistic regression,” Journal of the American Statistical Association, 113, 829–844.
- Whiteson, D. (2014), “SUSY,” UCI Machine Learning Repository, DOI: https://doi.org/10.24432/C54606.
- Wright, M. N., Wager, S., and Probst, P. (2020), “Ranger: A fast implementation of random forests,” R package version 0.12, 1.
- Xue, Y., Wang, H., Yan, J., and Schifano, E. D. (2020), “An online updating approach for testing the proportional hazards assumption with streams of survival data,” Biometrics, 76, 171–182.
- Yao, Y., Zou, J., and Wang, H. (2023), “Optimal poisson subsampling for softmax regression,” Journal of Systems Science and Complexity, 36, 1609–1625.
- Yu, J., Ai, M., and Ye, Z. (2023a), “A review on design inspired subsampling for big data,” Statistical Papers, 1–44.
- Yu, J., Liu, J., and Wang, H. (2023b), “Information-based optimal subdata selection for non-linear models,” Statistical Papers, 1–25.
- Zhang, J., Meng, C., Yu, J., Zhang, M., Zhong, W., and Ma, P. (2023), “An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation,” Journal of Computational and Graphical Statistics, 32, 329–339.
- Zhu, J., Wang, L., and Sun, F. (2024), “Group-Orthogonal Subsampling for Hierarchical Data Based on Linear Mixed Models,” Journal of Computational and Graphical Statistics, 0, 1–30.