Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability (2403.09548v2)
Abstract: Cancer is one of the diseases that kill the most women in the world, with breast cancer being responsible for the highest number of cancer cases and consequently deaths. However, it can be prevented by early detection and, consequently, early treatment. Any development for detection or perdition this kind of cancer is important for a better healthy life. Many studies focus on a model with high accuracy in cancer prediction, but sometimes accuracy alone may not always be a reliable metric. This study implies an investigative approach to studying the performance of different machine learning algorithms based on boosting to predict breast cancer focusing on the recall metric. Boosting machine learning algorithms has been proven to be an effective tool for detecting medical diseases. The dataset of the University of California, Irvine (UCI) repository has been utilized to train and test the model classifier that contains their attributes. The main objective of this study is to use state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and LightGBM to predict and diagnose breast cancer and to find the most effective metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study is the first to use these four boosting algorithms with Optuna, a library for hyperparameter optimization, and the SHAP method to improve the interpretability of our model, which can be used as a support to identify and predict breast cancer. We were able to improve AUC or recall for all the models and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more than 99.41\% for all models.
- W. H. Organization. (2023) Breast cancer. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer
- M. Arnold, E. Morgan, H. Rumgay, A. Mafra, D. Singh, M. Laversanne, J. Vignat, J. R. Gralow, F. Cardoso, S. Siesling, and I. Soerjomataram, “Current and future burden of breast cancer: Global statistics for 2020 and 2040,” The Breast, vol. 66, pp. 15–23, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0960977622001448
- M. d. O. Santos, F. C. d. S. d. Lima, L. F. L. Martins, J. F. P. Oliveira, L. M. d. Almeida, and M. d. C. Cancela, “Estimativa de incidência de câncer no brasil, 2023-2025,” Revista Brasileira de Cancerologia, vol. 69, no. 1, p. e–213700, fev. 2023. [Online]. Available: https://rbc.inca.gov.br/index.php/revista/article/view/3700
- E. Orrantia-Borunda, P. Anchondo-Nuñez, L. E. Acuña-Aguilar, F. O. Gómez-Valles, and C. A. Ramírez-Valdespino, “Subtypes of breast cancer,” in Breast Cancer. Exon Publications, Aug. 2022, pp. 31–42.
- K. S. Johnson, E. F. Conant, and M. S. Soo, “Molecular Subtypes of Breast Cancer: A Review for Breast Radiologists,” Journal of Breast Imaging, vol. 3, no. 1, pp. 12–24, 12 2020. [Online]. Available: https://doi.org/10.1093/jbi/wbaa110
- H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians, vol. 71, no. 3, pp. 209–249, 2021. [Online]. Available: https://acsjournals.onlinelibrary.wiley.com/doi/abs/10.3322/caac.21660
- A. Burguin, C. Diorio, and F. Durocher, “Breast cancer treatments: Updates and new challenges,” Journal of Personalized Medicine, vol. 11, no. 8, 2021. [Online]. Available: https://www.mdpi.com/2075-4426/11/8/808
- S. Ivanov and L. Prokhorenkova, “Boost then convolve: Gradient boosting meets graph neural networks,” CoRR, vol. abs/2101.08543, 2021. [Online]. Available: https://arxiv.org/abs/2101.08543
- S. Badirli, X. Liu, Z. Xing, A. Bhowmik, and S. S. Keerthi, “Gradient boosting neural networks: Grownet,” CoRR, vol. abs/2002.07971, 2020. [Online]. Available: https://arxiv.org/abs/2002.07971
- M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 1135–1144. [Online]. Available: https://doi.org/10.1145/2939672.2939778
- J. Sun, C.-K. Sun, Y.-X. Tang, T.-C. Liu, and C.-J. Lu, “Application of shap for explainable machine learning on age-based subgrouping mammography questionnaire data for positive mammography prediction and risk factor identification,” Healthcare, vol. 11, no. 14, 2023. [Online]. Available: https://www.mdpi.com/2227-9032/11/14/2000
- L. Antwarg, C. Galed, N. Shimoni, L. Rokach, and B. Shapira, “Shapley-based feature augmentation,” Information Fusion, vol. 96, pp. 92–102, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S156625352300091X
- S. Kaufman, S. Rosset, and C. Perlich, “Leakage in data mining: Formulation, detection, and avoidance,” vol. 6, 01 2011, pp. 556–563.
- G. M. M and S. P, “A survey on machine learning approaches used in breast cancer detection,” in 2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA), 2022, pp. 786–792.
- J. M. H. Pinheiro and M. Becker, “Um estudo sobre algoritmos de boosting e a otimização de hiperparâmetros utilizando optuna,” 2023. [Online]. Available: https://bdta.abcd.usp.br/item/003122385
- R. Rabiei, S. M. Ayyoubzadeh, S. Sohrabei, M. Esmaeili, and A. Atashi, “Prediction of breast cancer using machine learning approaches,” Journal of Biomedical Physics and Engineering, vol. 12, no. 3, pp. 297–308, 2022. [Online]. Available: https://jbpe.sums.ac.ir/article_48331.html
- T. Pang, J. H. D. Wong, W. L. Ng, and C. S. Chan, “Deep learning radiomics in breast cancer with different modalities: Overview and future,” Expert Systems with Applications, vol. 158, p. 113501, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417420303250
- T. Mahmood, J. Li, Y. Pei, F. Akhtar, A. Imran, and K. U. Rehman, “A brief survey on breast cancer diagnostic with deep learning schemes using multi-image modalities,” IEEE Access, vol. 8, pp. 165 779–165 809, 2020.
- A. U. Haq, J. P. Li, A. Saboor, J. Khan, S. Wali, S. Ahmad, A. Ali, G. A. Khan, and W. Zhou, “Detection of breast cancer through clinical data using supervised and unsupervised feature selection techniques,” IEEE Access, vol. 9, pp. 22 090–22 105, 2021.
- M. O. S. N. Wolberg, William and W. Street, “Breast Cancer Wisconsin (Diagnostic),” UCI Machine Learning Repository, 1995, DOI: https://doi.org/10.24432/C5DW2B.
- S. Ara, A. Das, and A. Dey, “Malignant and benign breast cancer classification using machine learning algorithms,” in 2021 International Conference on Artificial Intelligence (ICAI), 2021, pp. 97–101.
- I. Ozcan, H. Aydin, and A. Çetinkaya, “Comparison of classification success rates of different machine learning algorithms in the diagnosis of breast cancer,” Asian Pacific journal of cancer prevention : APJCP, vol. 23, pp. 3287–3297, 10 2022.
- A. Khalid, A. Mehmood, A. Alabrah, B. F. Alkhamees, F. Amin, H. AlSalman, and G. S. Choi, “Breast cancer detection and prevention using machine learning,” Diagnostics, vol. 13, no. 19, 2023. [Online]. Available: https://www.mdpi.com/2075-4418/13/19/3113
- M. A. Naji, S. E. Filali, K. Aarika, E. H. Benlahmar, R. A. Abdelouhahid, and O. Debauche, “Machine learning algorithms for breast cancer prediction and diagnosis,” Procedia Computer Science, vol. 191, pp. 487–492, 2021, the 18th International Conference on Mobile Systems and Pervasive Computing (MobiSPC), The 16th International Conference on Future Networks and Communications (FNC), The 11th International Conference on Sustainable Energy Information Technology. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877050921014629
- M. M. Islam, H. Iqbal, M. R. Haque, and M. K. Hasan, “Prediction of breast cancer using support vector machine and k-nearest neighbors,” in 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), 2017, pp. 226–229.
- T. Thomas, N. Pradhan, and V. S. Dhaka, “Comparative analysis to predict breast cancer using machine learning algorithms: A survey,” in 2020 International Conference on Inventive Computation Technologies (ICICT), 2020, pp. 192–196.
- Irmawati, F. Ernawan, M. Fakhreldin, and A. Saryoko, “Deep learning method based for breast cancer classification,” in 2023 International Conference on Information Technology Research and Innovation (ICITRI), 2023, pp. 13–16.
- P. S. Kohli and S. Arora, “Application of machine learning in disease prediction,” in 2018 4th International Conference on Computing Communication and Automation (ICCCA), 2018, pp. 1–4.
- S. Kabiraj, M. Raihan, N. Alvi, M. Afrin, L. Akter, S. A. Sohagi, and E. Podder, “Breast cancer risk prediction using xgboost and random forest algorithm,” in 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2020, pp. 1–4.
- U. Ojha and S. Goel, “A study on prediction of breast cancer recurrence using data mining techniques,” in 2017 7th International Conference on Cloud Computing, Data Science & Engineering - Confluence, 2017, pp. 527–530.
- A. Sharma, S. Kulshrestha, and S. Daniel, “Machine learning approaches for breast cancer diagnosis and prognosis,” pp. 1–5, 12 2017.
- W. Wolberg, “Breast Cancer Wisconsin (Original),” UCI Machine Learning Repository, 1992, DOI: https://doi.org/10.24432/C5HP4Z.
- A. Bharat, N. Pooja, and R. A. Reddy, “Using machine learning algorithms for breast cancer risk prediction and diagnosis,” in 2018 3rd International Conference on Circuits, Control, Communication and Computing (I4C), 2018, pp. 1–4.
- S. Das and D. Biswas, “Prediction of breast cancer using ensemble learning,” in 2019 5th International Conference on Advances in Electrical Engineering (ICAEE), 2019, pp. 804–808.
- T. Islam, A. Kundu, N. Islam Khan, C. Chandra Bonik, F. Akter, and M. Jihadul Islam, “Machine learning approaches to predict breast cancer: Bangladesh perspective,” in Ubiquitous Intelligent Systems, P. Karuppusamy, F. P. García Márquez, and T. N. Nguyen, Eds. Singapore: Springer Nature Singapore, 2022, pp. 291–305.
- B. M. Abed, K. Shaker, H. A. Jalab, H. Shaker, A. M. Mansoor, A. F. Alwan, and I. S. Al-Gburi, “A hybrid classification algorithm approach for breast cancer diagnosis,” in 2016 IEEE Industrial Electronics and Applications Conference (IEACon), 2016, pp. 269–274.
- D. Borkin, A. Nemethova, G. Michalconok, and K. Maiorov, “Impact of data normalization on classification model accuracy,” Research Papers Faculty of Materials Science and Technology Slovak University of Technology, vol. 27, pp. 79–84, 09 2019.
- D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1568494619302947
- K. Cabello-Solorzano, I. Ortigosa de Araujo, M. Peña, L. Correia, and A. J. Tallón-Ballesteros, “The impact of data normalization on the accuracy of machine learning algorithms: A comparative analysis,” in 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023), P. García Bringas, H. Pérez García, F. J. Martínez de Pisón, F. Martínez Álvarez, A. Troncoso Lora, Á. Herrero, J. L. Calvo Rolle, H. Quintián, and E. Corchado, Eds. Cham: Springer Nature Switzerland, 2023, pp. 344–353.
- R. E. Schapire, “A brief introduction to boosting,” in Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAI’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, p. 1401–1406.
- J. H. Friedman, “Greedy function approximation: A gradient boosting machine.” The Annals of Statistics, vol. 29, no. 5, pp. 1189 – 1232, 2001. [Online]. Available: https://doi.org/10.1214/aos/1013203451
- S. R. E. and F. Yoav, “A brief introduction to boosting,” in Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAI’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, p. 1401–1406.
- T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” CoRR, vol. abs/1603.02754, 2016. [Online]. Available: http://arxiv.org/abs/1603.02754
- A. V. Dorogush, A. Gulin, G. Gusev, N. Kazeev, L. O. Prokhorenkova, and A. Vorobev, “Fighting biases with dynamic boosting,” CoRR, vol. abs/1706.09516, 2017. [Online]. Available: http://arxiv.org/abs/1706.09516
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
- T. Fawcett, “An introduction to roc analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006, rOC Analysis in Pattern Recognition. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S016786550500303X
- C. D. Brown and H. T. Davis, “Receiver operating characteristics curves and related decision measures: A tutorial,” Chemometrics and Intelligent Laboratory Systems, vol. 80, no. 1, pp. 24–38, 2006. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169743905000766
- J. Hayes, A. Dekhtyar, and S. Sundaram, “Advancing candidate link generation for requirements tracing: the study of methods,” IEEE Transactions on Software Engineering, vol. 32, no. 1, pp. 4–19, 2006.
- T. Merten, D. Krämer, B. Mager, P. Schell, S. Bürsner, and B. Paech, “Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data?” in Requirements Engineering: Foundation for Software Quality, M. Daneva and O. Pastor, Eds. Cham: Springer International Publishing, 2016, pp. 45–62.
- D. M. Berry, “Evaluation of tools for hairy requirements and software engineering tasks,” in 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), 2017, pp. 284–291.
- J. P. Winkler, J. Grönberg, and A. Vogelsang, “Optimizing for recall in automatic requirements classification: An empirical study,” in 2019 IEEE 27th International Requirements Engineering Conference (RE), 2019, pp. 40–50.
- T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” CoRR, vol. abs/1907.10902, 2019. [Online]. Available: http://arxiv.org/abs/1907.10902
- S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From local explanations to global understanding with explainable ai for trees,” Nature Machine Intelligence, vol. 2, no. 1, pp. 2522–5839, 2020.
- S. M. Lundberg and S.-I. Lee, “Shap documentation.” [Online]. Available: https://shap.readthedocs.io/en/latest/
- S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” CoRR, vol. abs/1705.07874, 2017. [Online]. Available: http://arxiv.org/abs/1705.07874
- João Manoel Herrera Pinheiro (3 papers)
- Marcelo Becker (8 papers)