Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data (2401.02591v1)

Published 5 Jan 2024 in cs.LG and cs.AI

Abstract: This study examines the impact of class-imbalanced data on deep learning models and proposes a technique for data balancing by generating synthetic data for the minority class. Unlike random-based oversampling, our method prioritizes balancing the informative regions by identifying high entropy samples. Generating well-placed synthetic data can enhance machine learning algorithms accuracy and efficiency, whereas poorly-placed ones may lead to higher misclassification rates. We introduce an algorithm that maximizes the probability of generating a synthetic sample in the correct region of its class by optimizing the class posterior ratio. Additionally, to maintain data topology, synthetic data are generated within each minority sample's neighborhood. Our experimental results on forty-one datasets demonstrate the superior performance of our technique in enhancing deep-learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Q. Ya-Guan, M. Jun, Z. Xi-Min, P. Jun, Z. Wu-Jie, W. Shu-Hui, Y. Ben-Sheng, and L. Jing-Sheng, “EMSGD: An Improved Learning Algorithm of Neural Networks With Imbalanced Data,” IEEE Access, vol. 8, pp. 64 086–64 098, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9055020/
  2. Y. Liu, X. Li, X. Chen, X. Wang, and H. Li, “High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance,” Scientific Programming, vol. 2020, pp. 1–16, May 2020. [Online]. Available: https://www.hindawi.com/journals/sp/2020/1953461/
  3. J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” Journal of Big Data, vol. 6, no. 1, p. 27, Dec. 2019. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0192-5
  4. Y. Xie, M. Qiu, H. Zhang, L. Peng, and Z. Chen, “Gaussian distribution based oversampling for imbalanced data classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 2, pp. 667–679, 2022.
  5. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002. [Online]. Available: https://www.jair.org/index.php/jair/article/view/10302
  6. H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322–1328.
  7. H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887.
  8. D. Dablain, B. Krawczyk, and N. V. Chawla, “Deepsmote: Fusing deep learning and smote for imbalanced data,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
  9. L. Shen, Z. Lin, and Q. Huang, “Learning deep convolutional neural networks for places2 scene recognition,” CoRR, vol. abs/1512.05830, 2015. [Online]. Available: http://arxiv.org/abs/1512.05830
  10. Y. Geifman and R. El-Yaniv, “Deep active learning over the long tail,” CoRR, vol. abs/1711.00941, 2017. [Online]. Available: http://arxiv.org/abs/1711.00941
  11. Haibo He and E. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. [Online]. Available: http://ieeexplore.ieee.org/document/5128907/
  12. L. Li, H. He, and J. Li, “Entropy-based Sampling Approaches for Multi-Class Imbalanced Problems,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 11, pp. 2159–2170, Nov. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/8703114/
  13. S. Ertekin, J. Huang, and C. L. Giles, “Active learning for class imbalance problem,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07.   Amsterdam, The Netherlands: ACM Press, 2007, p. 823. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1277741.1277927
  14. W. Feng, G. Dauphin, W. Huang, Y. Quan, W. Bao, M. Wu, and Q. Li, “Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2159–2169, 2019.
  15. W. Feng, W. Huang, and W. Bao, “Imbalanced hyperspectral image classification with an adaptive ensemble method based on smote and rotation forest with differentiated sampling rates,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 12, pp. 1879–1883, 2019.
  16. C.-L. Liu and Y.-H. Chang, “Learning from imbalanced data with deep density hybrid sampling,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 11, pp. 7065–7077, 2022.
  17. C.-L. Liu and P.-Y. Hsieh, “Model-based synthetic sampling for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1543–1556, 2020.
  18. D. S. Goswami, “Class Imbalance, SMOTE, Borderline SMOTE, ADASYN,” Nov. 2020. [Online]. Available: https://towardsdatascience.com/class-imbalance-smote-borderline-smote-adasyn-6e36c78d804
  19. A. Liu, J. Ghosh, and C. E. Martin, “Generative oversampling for mining imbalanced datasets,” in Proceedings of the 2007 International Conference on Data Mining, DMIN 2007, June 25-28, 2007, Las Vegas, Nevada, USA, R. Stahlbock, S. F. Crone, and S. Lessmann, Eds.   CSREA Press, 2007, pp. 66–72.
  20. M. Rashid, J. Wang, and C. Li, “Convergence analysis of a method for variational inclusions,” Applicable Analysis, vol. 91, no. 10, pp. 1943–1956, Oct. 2012. [Online]. Available: https://www.tandfonline.com/doi/full/10.1080/00036811.2011.618127
  21. W. Dai, K. Ng, K. Severson, W. Huang, F. Anderson, and C. Stultz, “Generative Oversampling with a Contrastive Variational Autoencoder,” in 2019 IEEE International Conference on Data Mining (ICDM).   Beijing, China: IEEE, Nov. 2019, pp. 101–109. [Online]. Available: https://ieeexplore.ieee.org/document/8970705/
  22. S. S. Mullick, S. Datta, and S. Das, “Generative Adversarial Minority Oversampling,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV).   Seoul, Korea (South): IEEE, Oct. 2019, pp. 1695–1704. [Online]. Available: https://ieeexplore.ieee.org/document/9008836/
  23. S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the border: active learning in imbalanced data classification,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07.   Lisbon, Portugal: ACM Press, 2007, p. 127. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1321440.1321461
  24. U. Aggarwal, A. Popescu, and C. Hudelot, “Active Learning for Imbalanced Datasets,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).   Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 1417–1426. [Online]. Available: https://ieeexplore.ieee.org/document/9093475/
  25. Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-Balanced Loss Based on Effective Number of Samples,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Long Beach, CA, USA: IEEE, Jun. 2019, pp. 9260–9269. [Online]. Available: https://ieeexplore.ieee.org/document/8953804/
  26. C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning Deep Representation for Imbalanced Classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 5375–5384. [Online]. Available: http://ieeexplore.ieee.org/document/7780949/
  27. V. K. Rangarajan Sridhar, “Unsupervised Text Normalization Using Distributed Representations of Words and Phrases,” in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.   Denver, Colorado: Association for Computational Linguistics, 2015, pp. 8–16. [Online]. Available: http://aclweb.org/anthology/W15-1502
  28. D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” CoRR, vol. abs/1805.00932, 2018. [Online]. Available: http://arxiv.org/abs/1805.00932
  29. H. Masnadi-Shirazi, N. Vasconcelos, and A. Iranmehr, “Cost-sensitive support vector machines,” CoRR, vol. abs/1212.0975, 2012. [Online]. Available: http://arxiv.org/abs/1212.0975
  30. N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost: Improving prediction of the minority class in boosting,” in Knowledge Discovery in Databases: PKDD 2003, N. Lavrač, D. Gamberger, L. Todorovski, and H. Blockeel, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 107–119.
  31. C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010.
  32. P. Lim, C. K. Goh, and K. C. Tan, “Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning,” IEEE Transactions on Cybernetics, vol. 47, no. 9, pp. 2850–2861, 2017.
  33. S. Liu, Y. Wang, J. Zhang, C. Chen, and Y. Xiang, “Addressing the class imbalance problem in twitter spam detection using ensemble learning,” Computers and Security, vol. 69, pp. 35–49, 2017, security Data Science and Cyber Threat Management. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404816301754
  34. X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
  35. W. contributors, “Receiver operating characteristic Wikipedia. the free encyclopedia,” 2022. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Receiver_operating_characteristic&oldid=1081635328
  36. C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948. [Online]. Available: https://ieeexplore.ieee.org/document/6773024
  37. . M. B. Chazal, F., “An introduction to topological data analysis: Fundamental and practical aspects for data scientists,” Frontiers in artificial intelligence. [Online]. Available: https://doi.org/10.3389/frai.2021.667963
  38. D. Gissin and S. Shalev-Shwartz, “Discriminative active learning,” 2019. [Online]. Available: https://openreview.net/forum?id=rJl-HsR9KX
  39. Z. Qiu, D. J. Miller, and G. Kesidis, “A maximum entropy framework for semisupervised and active learning with unknown and label-scarce classes,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 4, pp. 917–933, 2017.
  40. B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08.   Honolulu, Hawaii: Association for Computational Linguistics, 2008, p. 1070. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1613715.1613855
  41. “Empirical bayes method,” Feb 2023. [Online]. Available: https://en.wikipedia.org/wiki/Empirical_Bayes_method
  42. “sklearn.neighbors.KernelDensity.” [Online]. Available: https://scikit-learn/stable/modules/generated/sklearn.neighbors.KernelDensity.html
  43. “Sphere,” Aug 2021. [Online]. Available: https://en.wikipedia.org/wiki/Sphere
  44. “Maximal and minimal points of functions theory.” [Online]. Available: https://aipc.tamu.edu/~schlump/24section4.1_math171.pdf
  45. “Keel: Software tool. evolutionary algorithms for data mining.” [Online]. Available: https://sci2s.ugr.es/keel/category.php?cat=imb#sub1
  46. A. Fernández, J. Luengo, J. Derrac, J. Alcalá-Fdez, and F. Herrera, “Implementation and integration of algorithms into the keel data-mining software tool,” Intelligent Data Engineering and Automated Learning - IDEAL 2009, p. 562–569, 2009.
  47. “fetch datasets 2014; Version 0.10.0 — imbalanced-learn.org,” https://imbalanced-learn.org/stable/references/generated/imblearn.datasets.fetch_datasets.html, [Accessed 16-Dec-2022].
  48. G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-365.html
  49. “Credit Card Fraud Detection — kaggle.com,” https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud, [Accessed 16-Dec-2022].
  50. Wikipedia contributors, “Wilcoxon signed-rank test — Wikipedia, the free encyclopedia,” 2022, [Online; accessed 22-December-2022]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Wilcoxon_signed-rank_test&oldid=1109073866
  51. K. P. F.R.S., “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hung Nguyen (48 papers)
  2. Morris Chang (5 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets