Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data (2401.02591v1)
Abstract: This study examines the impact of class-imbalanced data on deep learning models and proposes a technique for data balancing by generating synthetic data for the minority class. Unlike random-based oversampling, our method prioritizes balancing the informative regions by identifying high entropy samples. Generating well-placed synthetic data can enhance machine learning algorithms accuracy and efficiency, whereas poorly-placed ones may lead to higher misclassification rates. We introduce an algorithm that maximizes the probability of generating a synthetic sample in the correct region of its class by optimizing the class posterior ratio. Additionally, to maintain data topology, synthetic data are generated within each minority sample's neighborhood. Our experimental results on forty-one datasets demonstrate the superior performance of our technique in enhancing deep-learning models.
- Q. Ya-Guan, M. Jun, Z. Xi-Min, P. Jun, Z. Wu-Jie, W. Shu-Hui, Y. Ben-Sheng, and L. Jing-Sheng, “EMSGD: An Improved Learning Algorithm of Neural Networks With Imbalanced Data,” IEEE Access, vol. 8, pp. 64 086–64 098, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9055020/
- Y. Liu, X. Li, X. Chen, X. Wang, and H. Li, “High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance,” Scientific Programming, vol. 2020, pp. 1–16, May 2020. [Online]. Available: https://www.hindawi.com/journals/sp/2020/1953461/
- J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” Journal of Big Data, vol. 6, no. 1, p. 27, Dec. 2019. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0192-5
- Y. Xie, M. Qiu, H. Zhang, L. Peng, and Z. Chen, “Gaussian distribution based oversampling for imbalanced data classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 2, pp. 667–679, 2022.
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002. [Online]. Available: https://www.jair.org/index.php/jair/article/view/10302
- H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322–1328.
- H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887.
- D. Dablain, B. Krawczyk, and N. V. Chawla, “Deepsmote: Fusing deep learning and smote for imbalanced data,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
- L. Shen, Z. Lin, and Q. Huang, “Learning deep convolutional neural networks for places2 scene recognition,” CoRR, vol. abs/1512.05830, 2015. [Online]. Available: http://arxiv.org/abs/1512.05830
- Y. Geifman and R. El-Yaniv, “Deep active learning over the long tail,” CoRR, vol. abs/1711.00941, 2017. [Online]. Available: http://arxiv.org/abs/1711.00941
- Haibo He and E. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. [Online]. Available: http://ieeexplore.ieee.org/document/5128907/
- L. Li, H. He, and J. Li, “Entropy-based Sampling Approaches for Multi-Class Imbalanced Problems,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 11, pp. 2159–2170, Nov. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/8703114/
- S. Ertekin, J. Huang, and C. L. Giles, “Active learning for class imbalance problem,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07. Amsterdam, The Netherlands: ACM Press, 2007, p. 823. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1277741.1277927
- W. Feng, G. Dauphin, W. Huang, Y. Quan, W. Bao, M. Wu, and Q. Li, “Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2159–2169, 2019.
- W. Feng, W. Huang, and W. Bao, “Imbalanced hyperspectral image classification with an adaptive ensemble method based on smote and rotation forest with differentiated sampling rates,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 12, pp. 1879–1883, 2019.
- C.-L. Liu and Y.-H. Chang, “Learning from imbalanced data with deep density hybrid sampling,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 11, pp. 7065–7077, 2022.
- C.-L. Liu and P.-Y. Hsieh, “Model-based synthetic sampling for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1543–1556, 2020.
- D. S. Goswami, “Class Imbalance, SMOTE, Borderline SMOTE, ADASYN,” Nov. 2020. [Online]. Available: https://towardsdatascience.com/class-imbalance-smote-borderline-smote-adasyn-6e36c78d804
- A. Liu, J. Ghosh, and C. E. Martin, “Generative oversampling for mining imbalanced datasets,” in Proceedings of the 2007 International Conference on Data Mining, DMIN 2007, June 25-28, 2007, Las Vegas, Nevada, USA, R. Stahlbock, S. F. Crone, and S. Lessmann, Eds. CSREA Press, 2007, pp. 66–72.
- M. Rashid, J. Wang, and C. Li, “Convergence analysis of a method for variational inclusions,” Applicable Analysis, vol. 91, no. 10, pp. 1943–1956, Oct. 2012. [Online]. Available: https://www.tandfonline.com/doi/full/10.1080/00036811.2011.618127
- W. Dai, K. Ng, K. Severson, W. Huang, F. Anderson, and C. Stultz, “Generative Oversampling with a Contrastive Variational Autoencoder,” in 2019 IEEE International Conference on Data Mining (ICDM). Beijing, China: IEEE, Nov. 2019, pp. 101–109. [Online]. Available: https://ieeexplore.ieee.org/document/8970705/
- S. S. Mullick, S. Datta, and S. Das, “Generative Adversarial Minority Oversampling,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1695–1704. [Online]. Available: https://ieeexplore.ieee.org/document/9008836/
- S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the border: active learning in imbalanced data classification,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07. Lisbon, Portugal: ACM Press, 2007, p. 127. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1321440.1321461
- U. Aggarwal, A. Popescu, and C. Hudelot, “Active Learning for Imbalanced Datasets,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 1417–1426. [Online]. Available: https://ieeexplore.ieee.org/document/9093475/
- Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-Balanced Loss Based on Effective Number of Samples,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, Jun. 2019, pp. 9260–9269. [Online]. Available: https://ieeexplore.ieee.org/document/8953804/
- C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning Deep Representation for Imbalanced Classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 5375–5384. [Online]. Available: http://ieeexplore.ieee.org/document/7780949/
- V. K. Rangarajan Sridhar, “Unsupervised Text Normalization Using Distributed Representations of Words and Phrases,” in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Denver, Colorado: Association for Computational Linguistics, 2015, pp. 8–16. [Online]. Available: http://aclweb.org/anthology/W15-1502
- D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” CoRR, vol. abs/1805.00932, 2018. [Online]. Available: http://arxiv.org/abs/1805.00932
- H. Masnadi-Shirazi, N. Vasconcelos, and A. Iranmehr, “Cost-sensitive support vector machines,” CoRR, vol. abs/1212.0975, 2012. [Online]. Available: http://arxiv.org/abs/1212.0975
- N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost: Improving prediction of the minority class in boosting,” in Knowledge Discovery in Databases: PKDD 2003, N. Lavrač, D. Gamberger, L. Todorovski, and H. Blockeel, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 107–119.
- C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010.
- P. Lim, C. K. Goh, and K. C. Tan, “Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning,” IEEE Transactions on Cybernetics, vol. 47, no. 9, pp. 2850–2861, 2017.
- S. Liu, Y. Wang, J. Zhang, C. Chen, and Y. Xiang, “Addressing the class imbalance problem in twitter spam detection using ensemble learning,” Computers and Security, vol. 69, pp. 35–49, 2017, security Data Science and Cyber Threat Management. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404816301754
- X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
- W. contributors, “Receiver operating characteristic Wikipedia. the free encyclopedia,” 2022. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Receiver_operating_characteristic&oldid=1081635328
- C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948. [Online]. Available: https://ieeexplore.ieee.org/document/6773024
- . M. B. Chazal, F., “An introduction to topological data analysis: Fundamental and practical aspects for data scientists,” Frontiers in artificial intelligence. [Online]. Available: https://doi.org/10.3389/frai.2021.667963
- D. Gissin and S. Shalev-Shwartz, “Discriminative active learning,” 2019. [Online]. Available: https://openreview.net/forum?id=rJl-HsR9KX
- Z. Qiu, D. J. Miller, and G. Kesidis, “A maximum entropy framework for semisupervised and active learning with unknown and label-scarce classes,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 4, pp. 917–933, 2017.
- B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08. Honolulu, Hawaii: Association for Computational Linguistics, 2008, p. 1070. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1613715.1613855
- “Empirical bayes method,” Feb 2023. [Online]. Available: https://en.wikipedia.org/wiki/Empirical_Bayes_method
- “sklearn.neighbors.KernelDensity.” [Online]. Available: https://scikit-learn/stable/modules/generated/sklearn.neighbors.KernelDensity.html
- “Sphere,” Aug 2021. [Online]. Available: https://en.wikipedia.org/wiki/Sphere
- “Maximal and minimal points of functions theory.” [Online]. Available: https://aipc.tamu.edu/~schlump/24section4.1_math171.pdf
- “Keel: Software tool. evolutionary algorithms for data mining.” [Online]. Available: https://sci2s.ugr.es/keel/category.php?cat=imb#sub1
- A. Fernández, J. Luengo, J. Derrac, J. Alcalá-Fdez, and F. Herrera, “Implementation and integration of algorithms into the keel data-mining software tool,” Intelligent Data Engineering and Automated Learning - IDEAL 2009, p. 562–569, 2009.
- “fetch datasets 2014; Version 0.10.0 — imbalanced-learn.org,” https://imbalanced-learn.org/stable/references/generated/imblearn.datasets.fetch_datasets.html, [Accessed 16-Dec-2022].
- G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-365.html
- “Credit Card Fraud Detection — kaggle.com,” https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud, [Accessed 16-Dec-2022].
- Wikipedia contributors, “Wilcoxon signed-rank test — Wikipedia, the free encyclopedia,” 2022, [Online; accessed 22-December-2022]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Wilcoxon_signed-rank_test&oldid=1109073866
- K. P. F.R.S., “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
- Hung Nguyen (48 papers)
- Morris Chang (5 papers)