Marginal Laplacian Score (2311.17795v2)
Abstract: High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, we introduce a Marginal Laplacian Score (MLS), a modification of the well known Laplacian Score (LS) tailored to better address imbalanced data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the dataset's margin. We propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public datasets.
- Imbalance class problems in data mining: A review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3):1560–1571, 2019.
- Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems, 14, 2001.
- Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104, 2000.
- Cox, D. R. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2):215–232, 1958.
- Extremely randomized trees. Machine learning, 63:3–42, 2006.
- Feature selection for intrusion detection using random forest. Journal of information security, 7(3):129–140, 2016.
- Severely imbalanced big data challenges: investigating data sampling approaches. Journal of Big Data, 6(1):1–25, 2019.
- Sortad: Self-supervised optimized random transformations for anomaly detection in tabular data. arXiv preprint arXiv:2311.11018, 2023.
- Locality preserving projections. Advances in neural information processing systems, 16, 2003.
- Laplacian score for feature selection. Advances in neural information processing systems, 18, 2005.
- A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Comput. Surv., 52(4), aug 2019. ISSN 0360-0300. doi: 10.1145/3343440. URL https://doi.org/10.1145/3343440.
- Kornbrot, D. Point biserial correlation. Wiley StatsRef: Statistics Reference Online, 2014.
- Idvp (intra-die variation probe) for system-on-chip (soc) infant mortality screen. In 2011 IEEE International Symposium of Circuits and Systems (ISCAS), pp. 2055–2058. IEEE, 2011.
- Differentiable unsupervised feature selection based on a gated laplacian. Advances in Neural Information Processing Systems, 34:1530–1542, 2021.
- Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413–422. IEEE, 2008.
- Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for svm classification. Applied Soft Computing, 67:94–105, 2018.
- Massey Jr, F. J. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, pp. 68–78, 1951.
- Lasso: A feature selection technique in predictive modeling for machine learning. In 2016 IEEE international conference on advances in computer applications (ICACA), pp. 18–20. IEEE, 2016.
- Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):2168–2187, 2022. doi: 10.1109/TPAMI.2020.3031898.
- Feature selection based on artificial bee colony and gradient boosting decision tree. Applied Soft Computing, 74:634–642, 2019.
- Theoretical and empirical analysis of relieff and rrelieff. Machine learning, 53:23–69, 2003.
- Shebuti, R. “ODDS library”, 2016. URL “http://odds.cs.stonybrook.edu”.
- An adaptive cost-sensitive learning approach in neural networks to minimize local training–test class distributions mismatch. Intelligent Systems with Applications, 21:200316, 2024. ISSN 2667-3053. doi: https://doi.org/10.1016/j.iswa.2023.200316. URL https://www.sciencedirect.com/science/article/pii/S2667305323001412.
- A parameter-free cleaning method for smote in imbalanced classification. IEEE Access, 7:23537–23548, 2019.