Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification (2310.05789v1)

Published 9 Oct 2023 in cs.LG

Abstract: Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. D. Garcia-Gil, J. Luengo, S. Garcia, and F. Herrera, “Enabling smart data: Noise filtering in big data classification,” Information Sciences, vol. 479, pp. 135 – 152, 2019.
  2. I. Triguero, D. Garcia-Gil, J. Maillo, J. Luengo, S. Garcia, and F. Herrera, “Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 2, p. e1289, 2019.
  3. J. Vanhoeyveld and D. Martens, “Imbalanced classification in sparse and large behaviour datasets,” Data mining and knowledge discovery, vol. 32, no. 1, pp. 25–82, 2018.
  4. B. Zhu, X. Pan, S. vanden Broucke, and J. Xiao, “A gan-based hybrid sampling method for imbalanced customer classification,” Information Sciences, vol. 609, pp. 1397–1411, 2022.
  5. B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016.
  6. M. Juez-Gil, Á. Arnaiz-González, J. J. Rodríguez, C. López-Nozal, and C. García-Osorio, “Approx-smote: Fast smote for big data on apache spark,” Neurocomputing, vol. 464, pp. 432–437, 2021.
  7. A. Fernández, S. del Río, N. V. Chawla, and F. Herrera, “An insight into imbalanced big data classification: outcomes and challenges,” Complex & Intelligent Systems, vol. 3, no. 2, pp. 105–120, 2017.
  8. H. Kadkhodaei, A. M. Eftekhari Moghadam, and M. Dehghan, “Big data classification using heterogeneous ensemble classifiers in apache spark based on mapreduce paradigm,” Expert Systems with Applications, vol. 183, p. 115369, 2021.
  9. Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining m-smote and enn based on random forest for medical imbalanced data,” Journal of Biomedical Informatics, vol. 107, p. 103465, 2020.
  10. D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-2, no. 3, pp. 408–421, 1972.
  11. G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004.
  12. I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Herrera, “Evolutionary undersampling for extremely imbalanced big data classification under apache spark,” in 2016 IEEE Congress on Evolutionary Computation (CEC).   IEEE, 2016, pp. 640–647.
  13. S. Kamal, S. H. Ripon, N. Dey, A. S. Ashour, and V. Santhi, “A mapreduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset,” Computer methods and programs in biomedicine, vol. 131, pp. 191–206, 2016.
  14. Y. Yan, Y. Jiang, Z. Zheng, C. Yu, Y. Zhang, and Y. Zhang, “Ldas: Local density-based adaptive sampling for imbalanced data classification,” Expert Systems with Applications, vol. 191, p. 116213, 2022.
  15. J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey on addressing high-class imbalance in big data,” Journal of Big Data, vol. 5, no. 1, p. 42, 2018.
  16. I. Triguero, M. Galar, S. Vluymans, C. Cornelis, H. Bustince, F. Herrera, and Y. Saeys, “Evolutionary undersampling for imbalanced big data classification,” in 2015 IEEE Congress on Evolutionary Computation (CEC).   IEEE, 2015, pp. 715–722.
  17. H.-J. Kim, N.-O. Jo, and K.-S. Shin, “Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction,” Expert Systems with Applications, vol. 59, pp. 226–234, 2016.
  18. I. Nekooeimehr and S. Lai-Yuen, “Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets,” Expert Systems With Applications, vol. 46, pp. 405–416, 2016.
  19. N. V. Chawla, L. Hall, K. Bowyer, and W. Kegelmeyer, “SMOTE: synthetic minority oversampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
  20. H. He, Y. Bai, E. Garcia, and S. Li, “ADASYN: adaptive synthetic sampling approach for imbalanced learning,” in Proceedings of the IEEE International Joint Conference on Computational Intelligence IJCNN 2008, 2008, pp. 1322–1328.
  21. H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in International Conference on Intelligent Computing ICIC 2005: Advances in Intelligent Computing. Lecture Notes in Computer Science, S.-V. B. Heidelberg, Ed., vol. 3644, 2005, pp. 878–887.
  22. C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “DBSMOTE: Density-based synthetic minority over-sampling TEchnique,” Applied Intelligence, vol. 36, no. 3, pp. 664–684, 2012.
  23. S. Kandula, S. Krishnamoorthy, and D. Roy, “A prescriptive analytics framework for efficient e-commerce order delivery,” Decision Support Systems, vol. 147, p. 113584, 2021.
  24. K. Wang, J. Tian, C. Zheng, H. Yang, J. Ren, C. Li, Q. Han, and Y. Zhang, “Improving risk identification of adverse outcomes in chronic heart failure using smote+ enn and machine learning,” Risk Management and Healthcare Policy, vol. 14, p. 2453, 2021.
  25. S. Del Río, V. López, J. M. Benítez, and F. Herrera, “On the use of mapreduce for imbalanced big data using random forest,” Information Sciences, vol. 285, pp. 112–137, 2014.
  26. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  27. W. C. Sleeman IV and B. Krawczyk, “Multi-class imbalanced big data classification on spark,” Knowledge-Based Systems, vol. 212, p. 106598, 2021.
  28. P. A. Alaba, S. I. Popoola, L. Olatomiwa, M. B. Akanle, O. S. Ohunakin, E. Adetiba, O. D. Alex, A. A. Atayero, and W. M. A. Wan Daud, “Towards a more efficient and cost-sensitive extreme learning machine: A state-of-the-art review of recent trend,” Neurocomputing, vol. 350, pp. 70–90, 2019.
  29. V. López, S. Del Río, J. M. Benítez, and F. Herrera, “Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data,” Fuzzy Sets and Systems, vol. 258, pp. 5–38, 2015.
  30. J. Maillo, S. García, J. Luengo, F. Herrera, and I. Triguero, “Fast and scalable approaches to accelerate the fuzzy k-nearest neighbors classifier for big data,” IEEE Transactions on Fuzzy Systems, vol. 28, no. 5, pp. 874–886, 2020.
  31. A. Fernández, S. Río, V. López, A. Bawakid, M. del Jesus, J. Benítez, and F. Herrera, “Big data with cloud computing: an information sciencesight on the computing environment,” MapReduce and programming framework. WIREs Data Min Knowl Discov, vol. 4, no. 5, pp. 380–409, 2014.
  32. M. C. Srivas, P. Ravindra, U. V. Saradhi, A. A. Pande, C. G. K. B. Sanapala, L. V. Renu, V. Vellanki, S. Kavacheri, and A. A. Hadke, “Map-reduce ready distributed file system,” Apr. 26 2016, uS Patent 9,323,775.
  33. A. Spark, “Apache spark: Lightning-fast cluster computing,” 2016. [Online]. Available: http://spark.apache.org
  34. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al., “Mllib: Machine learning in apache spark,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.
  35. C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in PAKDD 2009: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science., S.-V. B. Heidelberg, Ed., vol. 5476, 2009, pp. 475–482.
  36. W. Siriseriwan and K. Sinapiromsaran, “Adaptive neighbor synthetic minority oversampling technique under 1NN outcast handling,” Songklanakarin Journal of Science and Technology, vol. 39, no. 5, pp. 565–576, 2017.
  37. S. Barua, M. Islam, X. Yao, and K. Murase, “MWMOTE - majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405–425, 2014.
  38. W. Siriseriwan and K. Sinapiromsaran, “The effective redistribution for imbalance dataset : Relocating safe-level smote with minority outcast handling,” Chiang Mai Journal of Science, vol. 43, no. 1, pp. 234 – 246, 2016.
  39. S. Maldonado, C. Vairetti, A. Fernandez, and F. Herrera, “Fw-smote: A feature-weighted oversampling approach for imbalanced classification,” Pattern Recognition, vol. 124, p. 108511, 2022.
  40. S. Maldonado, J. López, and C. Vairetti, “An alternative smote oversampling strategy for high-dimensional datasets,” Applied Soft Computing, vol. 76, pp. 380–389, 2019.
  41. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, “Mllib: Machine learning in apache spark,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235–1241, 2016.
  42. A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognition, vol. 91, pp. 216–231, 2019.
  43. X. Xiaolong, C. Wen, and S. Yanfei, “Over-sampling algorithm for imbalanced data classification,” Journal of Systems Engineering and Electronics, vol. 30, no. 6, pp. 1182–1191, 2019.
Citations (14)

Summary

We haven't generated a summary for this paper yet.