Dice Question Streamline Icon: https://streamlinehq.com

Effectiveness of data augmentation on predictive performance

Determine whether class-imbalance data augmentation techniques (such as SMOTE, ADASYN, SVMSMOTE, Random Over-Sampling, Random Under-Sampling, and Cluster Centroids) truly improve the predictive performance of machine-learning classifiers on real-world datasets, rather than merely appearing to help due to evaluation biases.

Information Square Streamline Icon: https://streamlinehq.com

Background

Data augmentation methods are widely used to address class imbalance by synthesizing or resampling minority-class instances, and many studies claim performance gains from techniques like SMOTE and ADASYN. However, concerns exist that standard evaluation pipelines may allow synthetic samples to leak into testing, yielding artificially inflated metrics.

The paper proposes an evaluation framework that excludes synthetic data from testing and reports that augmentation often does not significantly improve AUC across multiple healthcare and biological datasets. This motivates resolving the broader uncertainty about whether augmentation genuinely enhances predictive performance in practice.

References

Therefore, there still exists uncertainty as to whether data augmentation can truly help improving prediction performance.

Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL) (2301.10888 - Li et al., 2023) in Introduction