Oversampling for Imbalanced Learning Based on K-Means and SMOTE (1711.00837v2)

Published 2 Nov 2017 in cs.LG and stat.ML

Abstract: Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.

Authors (3)

Felix Last (3 papers)
Georgios Douzas (2 papers)
Fernando Bacao (3 papers)

Citations (665)

View on Semantic Scholar

Summary

Oversampling for Imbalanced Learning: K-Means and SMOTE-Based Approach

The paper "Oversampling for Imbalanced Learning Based on K-Means and SMOTE" addresses a critical issue in machine learning—class imbalance—in which certain classes of data are underrepresented, impairing the effectiveness of standard classification algorithms. The authors propose a novel method that combines k-means clustering with the Synthetic Minority Over-sampling Technique (SMOTE) to effectively address both between-class and within-class imbalances.

Methodology

The proposed method leverages the k-means clustering algorithm in conjunction with SMOTE to enhance the process of data oversampling. By clustering the dataset, the method identifies and focuses on safe areas—clusters with a high proportion of minority class instances—for generating synthetic data. This approach targets two primary concerns in imbalanced datasets: avoiding the generation of noisy samples and addressing imbalances within the minority class itself.

Key steps of the proposed method include:

Clustering: The entire input space is clustered using k-means. The number of clusters, $k$ , is a hyperparameter that significantly influences the method's efficacy.
Filtering: Clusters with a high minority to majority class ratio are selected for oversampling based on a tuneable imbalance ratio threshold (IRT).
Oversampling: SMOTE is applied within these selected clusters to generate new samples, focusing more on sparsely populated clusters to balance within-class distributions effectively.

Experimental Evaluation

The proposed method was evaluated across 71 datasets, demonstrating consistent improvements in classification performance compared to standard oversampling techniques such as random oversampling and vanilla SMOTE. Using stratified k-fold cross-validation, the method was tested against numerous classifiers, including Logistic Regression (LR), k-nearest neighbors (KNN), and gradient boosting machines (GBM), and was evaluated using performance metrics tailored for imbalanced datasets: the F1-score, g-mean, and area under the precision-recall curve (AUPRC).

The paper reports that k-means SMOTE generally outperformed other methods, achieving notable gains in classification accuracy, particularly in moderately difficult tasks—those neither trivial nor exceedingly challenging—where the greatest improvements were observed.

Theoretical and Practical Implications

Theoretically, the integration of clustering with SMOTE introduces a layer of informed decision-making into the oversampling process, enhancing the ability to generate meaningful, noise-resilient synthetic data. By addressing imbalance within the minority class, the method mitigates the small disjuncts problem, enabling classifiers to focus on less prevalent, yet significant sample subsets.

Practically, since k-means and SMOTE are widely implemented in various environments, the method is accessible and easily incorporated into existing workflows. Such adaptability allows it to be potentially impactful across various domains, including fraud detection, medical diagnosis, and environmental anomaly detection.

Future Directions

Potential future avenues include the exploration of adaptive clustering techniques to further refine the process of dynamic cluster formation and the development of general heuristics for setting hyperparameters. The application across diverse real-world datasets will help in understanding the method's limitations and prompt enhancements.

In conclusion, this paper contributes a valuable technique for addressing class imbalances in machine learning with empirical evidence supporting its effectiveness across multiple domains. Its simplicity, coupled with robust performance improvements, makes it a valuable tool in the toolkit of any researcher or practitioner dealing with imbalanced data.

PDF Markdown

Related Papers

Find Related Papers