Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning (2008.09202v1)

Published 20 Aug 2020 in cs.LG

Abstract: Class imbalance is a common problem in supervised learning and impedes the predictive performance of classification models. Popular countermeasures include oversampling the minority class. Standard methods like SMOTE rely on finding nearest neighbours and linear interpolations which are problematic in case of high-dimensional, complex data distributions. Generative Adversarial Networks (GANs) have been proposed as an alternative method for generating artificial minority examples as they can model complex distributions. However, prior research on GAN-based oversampling does not incorporate recent advancements from the literature on generating realistic tabular data with GANs. Previous studies also focus on numerical variables whereas categorical features are common in many business applications of classification methods such as credit scoring. The paper propoes an oversampling method based on a conditional Wasserstein GAN that can effectively model tabular datasets with numerical and categorical variables and pays special attention to the down-stream classification task through an auxiliary classifier loss. We benchmark our method against standard oversampling methods and the imbalanced baseline on seven real-world datasets. Empirical results evidence the competitiveness of GAN-based oversampling.

View on arXiv

Authors (2)

Justin Engelmann (12 papers)
Stefan Lessmann (34 papers)

Citations (191)

View on Semantic Scholar

Summary

Overview of Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

The paper presents a novel approach to address class imbalance in supervised learning using a Conditional Wasserstein GAN (cWGAN) for oversampling of tabular data. The primary focus is on enhancing classification models by generating synthetic minority class samples, particularly in the context of credit scoring datasets that typically exhibit imbalanced distributions.

Key Contributions and Methodology

The research introduces a cWGAN designed to adapt to tabular datasets encompassing both numerical and categorical variables. Unlike traditional oversampling techniques such as SMOTE, which relies on nearest-neighbor interpolation, the cWGAN leverages the generator-discriminator adversarial framework of GANs to model complex, high-dimensional distributions. This is particularly beneficial as tabular datasets often contain heterogeneous variables and complex interrelations between attributes.

The generator within the cWGAN is conditioned on the class label, facilitating the creation of synthetic samples from specific classes, addressing class imbalance. Additionally, the loss function includes an Auxiliary Classifier (AC) component, ensuring generated samples are recognizable as belonging to a particular class, which contributes to the downstream classification task.

The paper benchmarks the cWGAN approach against several standard oversampling techniques, including multiple SMOTE variants, across seven real-world credit scoring datasets. The GAN-based method outperforms traditional approaches, such as SMOTE, on datasets with complex, non-linear relationships, underscoring its capability to improve classification accuracy in scenarios where oversampling yields tangible benefits.

Numerical Results and Implications

Empirical evaluations demonstrate that cWGAN achieves competitive results, surpassing traditional oversampling methods in multiple datasets and scenarios. These findings suggest the importance of considering data complexity when selecting oversampling methods. For tabular data with intricate feature interdependencies and categorical attributes, the generative capabilities of GANs, coupled with the conditional approach, offer a robust solution for mitigating class imbalance.

The performance distinction between cWGAN and traditional techniques becomes more pronounced in datasets characterized by non-linear separability, indicating that cWGAN is particularly effective in complex data environments. This suggests a practical implication where cWGAN should be preferred for datasets with intricate variable interactions and class overlap in variable space.

Future Directions

The paper opens avenues for further exploration into the applicability of GAN-based oversampling across diverse domains plagued by class imbalance. Investigating datasets with even more pronounced class skew or conducting detailed hyperparameter tuning for GAN architectures may yield deeper insights. Additionally, exploring other forms of GAN architectures or loss function modifications could refine the generation of synthetic samples, enhancing performance further.

In sum, the research contributes a significant enhancement to oversampling techniques by incorporating advanced generative models tailored for complex, high-dimensional tabular datasets, setting a precedent for future work in imbalanced learning with deep generative methods.

PDF Markdown

Related Papers

Find Related Papers