SMOTE: Synthetic Minority Over-sampling Technique (1106.1813v1)

Published 9 Jun 2011 in cs.AI

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

Citations (23,868)

View on Semantic Scholar

Summary

The paper introduces SMOTE, a technique that generates synthetic minority examples by interpolating between samples to improve classifier performance on imbalanced datasets.
It addresses overfitting issues in traditional oversampling by creating new samples that form broader decision regions for the minority class.
Empirical results demonstrated improved AUC and ROC convex hull metrics across various datasets using classifiers such as C4.5 and Naive Bayes.

SMOTE: Synthetic Minority Over-sampling Technique

The paper "SMOTE: Synthetic Minority Over-sampling Technique" by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer presents a novel approach to the construction of classifiers from imbalanced datasets. Imbalanced datasets are common in real-world applications, such as fraud detection and medical diagnosis, where the minority class often represents the cases of interest.

Key Contributions

The core contribution of this paper is the introduction of SMOTE (Synthetic Minority Over-sampling TEchnique). Unlike traditional methods like under-sampling the majority class or over-sampling the minority class through replication, SMOTE improves classifier performance by generating synthetic examples for the minority class. These synthetic samples are created by interpolating between existing minority class samples and their nearest neighbors, thereby augmenting the minority class in a non-redundant manner.

Methodology

The SMOTE technique works as follows: for a given minority class sample, SMOTE identifies its k-nearest neighbors. New synthetic samples are then generated along the lines connecting the sample and its neighbors. Specifically, a random point is chosen in the feature space between each pair of the sample and a randomly selected neighboring sample.

Experimental Setup and Results

The efficacy of SMOTE was evaluated using several machine learning algorithms, namely C4.5, Ripper, and Naive Bayes, on nine different datasets. These datasets were Pima Indian Diabetes, Phoneme, Adult, E-state, Satimage, Forest Cover, Oil, Mammography, and Can. The evaluation metric used was the area under the ROC curve (AUC), alongside the ROC convex hull strategy.

The empirical results demonstrated that SMOTE, combined with under-sampling the majority class, outperformed other methods, including plain under-sampling and approaches that vary the loss ratios in Ripper or class priors in Naive Bayes. For instance, the Phoneme dataset showed that SMOTE-C4.5 improved both the AUC and the ROC convex hull compared to plain under-sampling and Naive Bayes.

Practical and Theoretical Implications

The practical implications of SMOTE are substantial. By generating synthetic samples, SMOTE alleviates the problem of overfitting associated with over-sampling through replication. It also allows classifiers to create broader and more general decision regions for the minority class, thereby improving minority class detection without significantly compromising the accuracy of the majority class. This is particularly important in applications where the cost of misclassifying minority instances is significantly higher than misclassifying majority instances.

Theoretically, SMOTE introduces a novel perspective on data augmentation in the context of class imbalance. The paper explores the impact of synthetic sample generation on various commonly used classifiers, providing a robust validation across diverse datasets. Moreover, it sets the stage for further research into adaptive selection of nearest neighbors and handling mixed nominal and continuous datasets.

Future Directions

Future developments in this area could explore several extensions:

Automated Selection of Nearest Neighbors: Developing an algorithm to adaptively select the number of nearest neighbors for synthetic sample generation could further enhance SMOTE's effectiveness.
Handling Mixed Datasets: Extending SMOTE to handle datasets with a mix of nominal and continuous features more effectively, possibly through techniques like SMOTE-NC, which was briefly explored in the paper.
Iterative Improvements: Introducing iterations where synthetic samples that lead to misclassifications are given more attention could help in refining the decision boundary further.

In summary, the SMOTE technique presents a significant advancement in dealing with imbalanced datasets. By generating synthetic samples, it not only improves the sensitivity of classifiers to the minority class but also addresses the common pitfalls of traditional over-sampling methods. With its robust validation across multiple datasets and classifiers, SMOTE stands as a compelling tool for researchers and practitioners dealing with imbalanced data scenarios.

PDF Markdown

Related Papers

Tweets

https://twitter.com/andrewstrasberg/status/1783609957482389943

https://twitter.com/de_bose/status/1744809622308061347

https://twitter.com/de_bose/status/1850816378858774977

YouTube

Show All Videos