Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems (2307.00157v1)

Published 30 Jun 2023 in cs.LG and stat.ML

Abstract: Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. A novel random forest integrated model for imbalanced data classification problem. Knowledge-Based Systems, 250:109050, 2022.
  2. A relabeling approach to handling the class imbalance problem for logistic regression. Journal of Computational and Graphical Statistics, 31(1):241–253, 2022.
  3. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  4. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328. IEEE, 2008.
  5. Hellinger distance decision trees for pu learning in imbalanced data sets. Machine Learning, pages 1–32, 2023.
  6. Explainability of smote based oversampling for imbalanced dataset problems. In 2020 3rd international conference on information and computer technologies (ICICT), pages 41–45. IEEE, 2020.
  7. Effect of data resampling on feature importance in imbalanced blockchain data: Comparison studies of resampling techniques. Data Science and Management, 5(2):66–76, 2022.
  8. Comparison of feature importance measures as explanations for classification models. SN Applied Sciences, 3:1–12, 2021.
  9. Explainable expected goal models for performance analysis in football analytics. In 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–9, 2022.
  10. A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery, pages 1–59, 2023.
  11. Statistical stability indices for lime: Obtaining reliable explanations for machine learning models. Journal of the Operational Research Society, 73(1):91–101, 2022.
  12. Why don’t xai techniques agree? characterizing the disagreements between post-hoc explanations of defect predictions. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 444–448. IEEE, 2022.
  13. Towards an integrated evaluation framework for xai: an experimental study. Procedia Computer Science, 207:3884–3893, 2022.
  14. Openxai: Towards a transparent evaluation of model explanations. Advances in Neural Information Processing Systems, 35:15784–15799, 2022.
  15. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  16. Explanatory Model Analysis. Chapman and Hall/CRC, New York, 2021.
  17. Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4):1059–1086, 2020.
  18. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20(177):1–81, 2019.
  19. Data representing ground-truth explanations to evaluate xai methods. arXiv preprint arXiv:2011.09892, 2020.
  20. Visualizing the feature importance for black box models. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, pages 655–670. Springer, 2019.
  21. Automated imbalanced classification via meta-learning. Expert Systems with Applications, 178:115011, 2021.
  22. Automated imbalanced learning. arXiv preprint arXiv:2211.00376, 2022.
  23. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017.
  24. Zejin Ding. Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. PhD thesis, Georgia State University, 2011.
  25. No free lunch in imbalanced learning. Knowledge-Based Systems, 227:107222, 2021.
  26. Cullen Schaffer. A conservation law for generalization performance. In Machine Learning Proceedings 1994, pages 259–265. Elsevier, 1994.
Citations (3)

Summary

We haven't generated a summary for this paper yet.