Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLpowershap: Logistic Loss-based Automated Shapley Values Feature Selection Method (2401.12683v1)

Published 23 Jan 2024 in cs.LG

Abstract: Shapley values have been used extensively in machine learning, not only to explain black box machine learning models, but among other tasks, also to conduct model debugging, sensitivity and fairness analyses and to select important features for robust modelling and for further follow-up analyses. Shapley values satisfy certain axioms that promote fairness in distributing contributions of features toward prediction or reducing error, after accounting for non-linear relationships and interactions when complex machine learning models are employed. Recently, a number of feature selection methods utilising Shapley values have been introduced. Here, we present a novel feature selection method, LLpowershap, which makes use of loss-based Shapley values to identify informative features with minimal noise among the selected sets of features. Our simulation results show that LLpowershap not only identifies higher number of informative features but outputs fewer noise features compared to other state-of-the-art feature selection methods. Benchmarking results on four real-world datasets demonstrate higher or at par predictive performance of LLpowershap compared to other Shapley based wrapper methods, or filter methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6):1–45, 2017.
  2. Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study. Scientific reports, 11(1):22997, 2021.
  3. Combining machine learning with cox models to identify predictors for incident post-menopausal breast cancer in the uk biobank. Scientific Reports, 13(1):9221, 2023.
  4. Powershap: A power-full shapley feature selection method. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71–87. Springer, 2022.
  5. Lloyd S Shapley et al. A value for n-person games. 1953.
  6. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  7. The explanation game: Explaining machine learning models using shapley values. In Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2020, Dublin, Ireland, August 25–28, 2020, Proceedings 4, pages 17–38. Springer, 2020.
  8. Algorithms to estimate shapley value feature attributions. Nature Machine Intelligence, pages 1–12, 2023.
  9. From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56–67, 2020.
  10. Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33:17212–17223, 2020.
  11. Interpretable feature subset selection: A shapley value based approach. In 2020 IEEE International Conference on Big Data (Big Data), pages 5463–5472. IEEE, 2020.
  12. Eoghan Keany. Ekeany/boruta-shap: A tree based feature selection tool which combines both the boruta feature selection algorithm with shapley values. https://github.com/Ekeany/Boruta-Shap, November 2020. Accessed: January 2, 2024.
  13. Manuel Calzolari. manuel-calzolari/shapicant: Feature selection package based on SHAP and target permutation, for pandas and Spark. https://github.com/manuel-calzolari/shapicant, November 2020. Accessed: January 2, 2024.
  14. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.
  15. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143:106839, 2020.
  16. Filter versus wrapper feature subset selection in large dimensionality micro array: A review. International Journal of Computer Science and Information Technologies, 2(3):1048–1053, 2011.
  17. Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018.
  18. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020.
  19. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
  20. Feature selection for high-dimensional data. Springer, 2015.
  21. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  22. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  23. Result analysis of the nips 2003 feature selection challenge. Advances in neural information processing systems, 17, 2004. URL https://www.openml.org/d/1485.
  24. Gina priori, 2014. URL https://www.openml.org/d/1042.
  25. Xipeng Shen Matthew Boutell, Jiebo Luo and Christopher Brown. Scene, 2004. URL https://www.openml.org/d/1042.
  26. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
  27. Hypothesis-free discovery of novel cancer predictors using machine learning. European Journal of Clinical Investigation, page e14037, 2023.
  28. Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30:3146–3154, 2017.
  29. Filter methods for feature selection–a comparative study. In International Conference on Intelligent Data Engineering and Automated Learning, pages 178–187. Springer, 2007.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets