Histogram-Based Federated XGBoost using Minimal Variance Sampling for Federated Tabular Data (2405.02067v1)
Abstract: Federated Learning (FL) has gained considerable traction, yet, for tabular data, FL has received less attention. Most FL research has focused on Neural Networks while Tree-Based Models (TBMs) such as XGBoost have historically performed better on tabular data. It has been shown that subsampling of training data when building trees can improve performance but it is an open problem whether such subsampling can improve performance in FL. In this paper, we evaluate a histogram-based federated XGBoost that uses Minimal Variance Sampling (MVS). We demonstrate the underlying algorithm and show that our model using MVS can improve performance in terms of accuracy and regression error in a federated setting. In our evaluation, our model using MVS performs better than uniform (random) sampling and no sampling at all. It achieves both outstanding local and global performance on a new set of federated tabular datasets. Federated XGBoost using MVS also outperforms centralized XGBoost in half of the studied cases.
- J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
- C. Zhang, Y. Xie, H. Bai, B. Yu, W. Li, and Y. Gao, “A survey on federated learning,” Knowledge-Based Systems, vol. 216, p. 106775, 2021.
- V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- S. Ö. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6679–6687, 2021.
- W. Lindskog and C. Prehofer, “Federated learning for tabular data using tabnet: A vehicular use-case,” in 2022 IEEE 18th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 105–111, IEEE, 2022.
- R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” Information Fusion, vol. 81, pp. 84–90, 2022.
- C. Kern, T. Klausch, and F. Kreuter, “Tree-based machine learning methods for survey research,” in Survey research methods, vol. 13, p. 73, NIH Public Access, 2019.
- T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou, et al., “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1–4, 2015.
- B. Ibragimov and G. Gusev, “Minimal variance sampling in stochastic gradient boosting,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- W. Chen, G. Ma, T. Fan, Y. Kang, Q. Xu, and Q. Yang, “Secureboost+: A high performance gradient boosting tree framework for large scale vertical federated learning,” arXiv preprint arXiv:2110.10927, 2021.
- Y. Liu, T. Fan, T. Chen, Q. Xu, and Q. Yang, “Fate: An industrial grade platform for collaborative learning with data protection,” J. Mach. Learn. Res., vol. 22, jan 2021.
- D. J. Beutel, T. Topal, A. Mathur, X. Qiu, T. Parcollet, P. P. de Gusmão, and N. D. Lane, “Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020.
- H. R. Roth, Y. Cheng, Y. Wen, I. Yang, Z. Xu, Y.-T. Hsieh, K. Kersten, A. Harouni, C. Zhao, K. Lu, et al., “Nvidia flare: Federated learning from simulation to real-world,” arXiv preprint arXiv:2210.13291, 2022.
- Y. Liu, Y. Liu, Z. Liu, Y. Liang, C. Meng, J. Zhang, and Y. Zheng, “Federated forest,” IEEE Transactions on Big Data, vol. 8, no. 3, pp. 843–854, 2020.
- L. A. C. de Souza, G. A. F. Rebello, G. F. Camilo, L. C. Guimarães, and O. C. M. Duarte, “Dfedforest: Decentralized federated forest,” in 2020 IEEE International conference on blockchain (blockchain), pp. 90–97, IEEE, 2020.
- I. Aliyu, M. C. Feliciano, S. Van Engelenburg, D. O. Kim, and C. G. Lim, “A blockchain-based federated forest for sdn-enabled in-vehicle network intrusion detection system,” IEEE Access, vol. 9, pp. 102593–102608, 2021.
- Y. Wu, S. Cai, X. Xiao, G. Chen, and B. C. Ooi, “Privacy preserving vertical federated learning for tree-based models,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2090–2103, 2020.
- M. Yang, L. Song, J. Xu, C. Li, and G. Tan, “The tradeoff between privacy and accuracy in anomaly detection using federated xgboost,” arXiv preprint arXiv:1907.07157, 2019.
- Y. Liu, Z. Ma, X. Liu, S. Ma, S. Nepal, and R. Deng, “Boosting privately: Privacy-preserving federated extreme boosting for mobile crowdsensing,” arXiv preprint arXiv:1907.10218, 2019.
- Z. Feng, H. Xiong, C. Song, S. Yang, B. Zhao, L. Wang, Z. Chen, S. Yang, L. Liu, and J. Huan, “Securegbm: Secure multi-party gradient boosting,” in 2019 IEEE International Conference on Big Data (Big Data), pp. 1312–1321, IEEE, 2019.
- Z. Tian, R. Zhang, X. Hou, J. Liu, and K. Ren, “Federboost: Private federated learning for gbdt,” arXiv preprint arXiv:2011.02796, 2020.
- K. Cheng, T. Fan, Y. Jin, Y. Liu, T. Chen, D. Papadopoulos, and Q. Yang, “Secureboost: A lossless federated learning framework,” IEEE Intelligent Systems, vol. 36, no. 6, pp. 87–98, 2021.
- Q. Li, Y. Cai, Y. Han, C. M. Yung, T. Fu, and B. He, “Fedtree: A fast, effective, and secure tree-based federated learning system.” https://github.com/Xtra-Computing/FedTree/blob/main/FedTree\_draft\_paper.pdf, 2022.
- Y. J. Ong, Y. Zhou, N. Baracaldo, and H. Ludwig, “Adaptive histogram-based gradient boosted trees for federated learning,” arXiv preprint arXiv:2012.06670, 2020.
- K. Jones, Y. J. Ong, Y. Zhou, and N. Baracaldo, “Federated xgboost on sample-wise non-iid data,” arXiv preprint arXiv:2209.01340, 2022.
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
- C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artificial Intelligence Review, vol. 54, pp. 1937–1967, 2021.
- E. Rizk, S. Vlaski, and A. H. Sayed, “Optimal importance sampling for federated learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3095–3099, IEEE, 2021.
- Z. Zhu, P. Fan, C. Peng, and K. B. Letaief, “Isfl: Trustworthy federated learning for non-iid data with local importance sampling,” arXiv preprint arXiv:2210.02119, 2022.
- R. Ou, “Out-of-core gpu gradient boosting,” arXiv preprint arXiv:2005.09148, 2020.
- N. Sokolovska and Y. M. Behbahani, “Vanishing boosted weights: A consistent algorithm to learn interpretable rules,” Pattern Recognition Letters, vol. 152, pp. 63–69, 2021.
- S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097, 2018.
- S. Matzka, “Explainable artificial intelligence for predictive maintenance applications,” in 2020 third international conference on artificial intelligence for industries (ai4i), pp. 69–74, IEEE, 2020.
- E. Afshari Safavi, “Lumpy skin disease dataset,” 2021.
- D. Dua and C. Graff, “UCI machine learning repository,” 2017.
- Kaggle, “Kaggle insurance premium prediction,” 2020.
- Kaggle, “Kaggle airline delay,” 2020.
- Kaggle, “Kaggle credit card fraud,” 2018.
- P. Chen, X. Du, Z. Lu, J. Wu, and P. C. Hung, “Evfl: An explainable vertical federated learning for data-oriented artificial intelligence systems,” Journal of Systems Architecture, vol. 126, p. 102474, 2022.