Skew-Probabilistic Neural Networks for Learning from Imbalanced Data (2312.05878v2)
Abstract: Real-world datasets often exhibit imbalanced data distribution, where certain class levels are severely underrepresented. In such cases, traditional pattern classifiers have shown a bias towards the majority class, impeding accurate predictions for the minority class. This paper introduces an imbalanced data-oriented classifier using probabilistic neural networks (PNN) with a skew-normal kernel function to address this major challenge. PNN is known for providing probabilistic outputs, enabling quantification of prediction confidence, interpretability, and the ability to handle limited data. By leveraging the skew-normal distribution, which offers increased flexibility, particularly for imbalanced and non-symmetric data, our proposed Skew-Probabilistic Neural Networks (SkewPNN) can better represent underlying class densities. Hyperparameter fine-tuning is imperative to optimize the performance of the proposed approach on imbalanced datasets. To this end, we employ a population-based heuristic algorithm, the Bat optimization algorithm, to explore the hyperparameter space effectively. We also prove the statistical consistency of the density estimates, suggesting that the true distribution will be approached smoothly as the sample size increases. Theoretical analysis of the computational complexity of the proposed SkewPNN and BA-SkewPNN is also provided. Numerical simulations have been conducted on different synthetic datasets, comparing various benchmark-imbalanced learners. Real-data analysis on several datasets shows that SkewPNN and BA-SkewPNN substantially outperform most state-of-the-art machine-learning methods for both balanced and imbalanced datasets (binary and multi-class categories) in most experimental settings.
- Comprehensive review of artificial neural network applications to pattern recognition. IEEE access 7, 158820–158846.
- Applying support vector machines to imbalanced datasets, in: Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, Springer. pp. 39–50.
- Characterizations of the skew-normal and generalized chi distributions. Sankhyā: The Indian Journal of Statistics , 593–606.
- A class of distributions which includes the normal ones. Scandinavian journal of statistics , 171–178.
- The skew-normal distribution and related multivariate families. Scandinavian journal of statistics 32, 159–188.
- The skew-normal and related families. volume 3. Cambridge University Press.
- Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 579–602.
- The multivariate skew-normal distribution. Biometrika 83, 715–726.
- robrose: A robust approach for dealing with imbalanced data in fraud detection. Statistical Methods & Applications 30, 841–861.
- Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on knowledge and data engineering 26, 405–425.
- A random forest guided tour. Test 25, 197–227.
- Smote for high-dimensional class-imbalanced data. BMC bioinformatics 14, 1–16.
- Decision tree induction based on minority entropy for the class imbalance problem. Pattern Analysis and Applications 20, 769–782.
- Weighted data gravitation classification for standard and imbalanced data. IEEE transactions on cybernetics 43, 1672–1687.
- Enhancing techniques for learning decision trees from imbalanced data. Advances in Data Analysis and Classification 14, 677–745.
- Hellinger net: A hybrid imbalance learning model to improve software defect prediction. IEEE Transactions on Reliability 70, 481–494.
- Superensemble classifier for improving predictions in imbalanced datasets. Communications in Statistics: Case Studies, Data Analysis and Applications 6, 123–141.
- Ten years of Generative Adversarial Nets (GANs): A survey of the state-of-the-art. arXiv:2308.16316.
- Smoteboost: Improving prediction of the minority class in boosting, in: Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7, Springer. pp. 107–119.
- Xgboost: extreme gradient boosting. R package version 0.4-2 1, 1–4.
- Learning decision trees for unbalanced data, in: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I 19, Springer. pp. 241–256.
- Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24, 136–158.
- Addressing imbalance in multi-label classification using structured hellinger forests, in: Proceedings of the AAAI Conference on Artificial Intelligence.
- Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks 70, 39–52.
- Pattern classification and scene analysis. ed: Wiley Interscience .
- To smote, or not to smote? arXiv preprint arXiv:2201.08528 .
- Preprocessing unbalanced data using support vector machine. Decision Support Systems 53, 226–233.
- A novel two-phase clustering-based under-sampling method for imbalanced classification problems. Expert Systems with Applications 213, 119003.
- Learning from imbalanced data sets. volume 10. Springer.
- Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61, 863–905.
- Discussion of” the skew-normal”. Scandinavian Journal of Statistics 32, 189–198.
- A hybrid evolutionary under-sampling method for handling the class imbalance problem with overlap in credit classification. Journal of Systems Science and Systems Engineering 31, 728–752.
- Hellinger distance weighted ensemble for imbalanced data stream classification. Journal of Computational Science 51, 101314.
- A novel random forest integrated model for imbalanced data classification problem. Knowledge-Based Systems 250, 109050.
- A multivariate skew normal distribution. Journal of multivariate analysis 89, 181–190.
- Characterization of the skew-normal distribution. Annals of the Institute of Statistical Mathematics 56, 351–360.
- Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: International conference on intelligent computing, Springer. pp. 878–887.
- The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 29–36.
- Multi-class adaboost. Statistics and its Interface 2, 349–360.
- Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), Ieee. pp. 1322–1328.
- Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 1263–1284.
- Decision trees: a recent overview. Artificial Intelligence Review 39, 261–283.
- Csmoute: Combined synthetic oversampling and undersampling technique for imbalanced data classification, in: 2021 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
- Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5, 221–232.
- Geometric skew normal distribution. Sankhya B 76, 167–189.
- Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 18, 559–563.
- Hellinger distance trees for imbalanced streams, in: 2014 22nd International conference on pattern recognition, IEEE. pp. 1969–1974.
- Probabilistic neural-network structure determination for pattern classification. IEEE Transactions on neural networks 11, 1009–1016.
- A weighted probabilistic neural network, in: Advances in Neural Information Processing Systems, pp. 1110–1117.
- Inverse free reduced universum twin support vector machine for imbalanced data classification. Neural Networks 157, 125–135.
- Bat algorithm-based weighted laplacian probabilistic neural network. Neural Computing and Applications 32, 1157–1171.
- On estimation of a probability density function and mode. The annals of mathematical statistics 33, 1065–1076.
- Neural network classifiers estimate bayesian a posteriori probabilities. Neural computation 3, 461–483. doi:10.1162/neco.1991.3.4.461.
- Neural networks and related methods for classification. Journal of the Royal Statistical Society: Series B (Methodological) 56, 409–437.
- Assessing generative models via precision and recall. Advances in neural information processing systems 31.
- Hesitant fuzzy decision tree approach for highly imbalanced data classification. Applied Soft Computing 61, 727–741.
- Preprocessing unbalanced data using support vector machine with method k-nearest neighbors for cerebral infarction classification, in: Journal of Physics: Conference Series, IOP Publishing. p. 012037.
- Tabular data: Deep learning is not all you need. Information Fusion 81, 84–90.
- Probabilistic neural networks. Neural networks 3, 109–118. doi:10.1016/0893-6080(90)90049-Q.
- Probabilistic neural networks and the polynomial adaline as complementary techniques for classification. IEEE Transactions on Neural Networks 1, 111–121. doi:10.1109/72.80210.
- Improving random forest and rotation forest for highly imbalanced datasets. Intelligent Data Analysis 19, 1409–1432.
- Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition 45, 3738–3750.
- Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost. Pattern Recognition Letters 136, 190–197.
- Kernel principle component analysis and random under sampling boost based fault diagnosis method and its application to a pressurized water reactor. Nuclear Engineering and Design 406, 112258.
- A non-convex robust small sphere and large margin support vector machine for imbalanced data classification. Neural Computing and Applications 35, 3245–3261.
- Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics , 408–421.
- Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials , 1–3.
- An improved unbalanced data classification method based on hybrid sampling approach, in: 2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI), IEEE. pp. 125–129.
- Ba-pnn-based methods for power transformer fault diagnosis. Advanced engineering informatics 39, 178–185.
- A new metaheuristic bat-inspired algorithm, in: Nature inspired cooperative strategies for optimization (NICSO). Springer, pp. 65–74.
- Delving into deep imbalanced regression, in: International Conference on Machine Learning, PMLR. pp. 11842–11851.
- Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36, 5718–5727.
- Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem. Advances in Mechanical Engineering 8, 1–13. doi:10.1177/1687814015624832.
- Chsmote: Convex hull-based synthetic minority oversampling technique for alleviating the class imbalance problem. Information Sciences 623, 324–341.
- Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30, 451–462. doi:10.1109/5326.897072.
- Interaction between bdnf and tnf-α𝛼\alphaitalic_α genes in schizophrenia. Psychoneuroendocrinology 89, 1–6.
- Evolutionary-based ensemble under-sampling for imbalanced data, in: 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing, IEEE. pp. 212–216.
- An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowledge-Based Systems 216, 106800.
- Classification trees for imbalanced data: Surface-to-volume regularization. Journal of the American Statistical Association , 1–11.