Random Over-Sampling (ROS) Methods
- Random Over-Sampling (ROS) is a technique that generates synthetic minority examples using kernel density estimation to mitigate severe class imbalance.
- Extensions like robROSE apply robust covariance estimation and outlier filtering to preserve data integrity and enhance model stability.
- Empirical studies demonstrate that robust ROS variants can significantly improve AUC and AUPRC in fraud detection, churn prediction, and other critical applications.
Random Over-Sampling (ROS) and its robust extensions are synthetic data generation techniques designed to alleviate class imbalance in supervised learning, particularly in high-stakes settings such as fraud detection, where minority class prevalence may fall below 0.5%. These approaches augment the minority class by generating synthetic observations that aim to preserve the feature distribution, thereby improving classifier sensitivity without excessively distorting the underlying data manifold (Baesens et al., 2020).
1. Foundations of ROSE (Random Over-Sampling Examples)
ROSE addresses severe class imbalance by creating synthetic minority examples through localized kernel density estimation. Let denote the minority-class training data (, being the number of majority points). The ROSE mechanism proceeds as:
- Kernel Density Estimation: For a point , the minority-class distribution is estimated as:
where is a -variate Gaussian kernel with a symmetric positive-definite smoothing matrix . The standard choice for the Gaussian kernel is:
- Bandwidth Selection: The smoothing parameter is critical. A diagonal is built as:
with the sample standard deviation of feature , a tuning parameter (–$0.7$), and .
- Synthetic Sample Generation: Each synthetic minority-class point is drawn by selecting uniformly at random and sampling , then setting . This process is repeated until the minority class reaches the desired cardinality.
2. The robROSE Methodology for Robust Random Over-Sampling
robROSE extends ROSE by enhancing robustness to outliers, targeting the pervasive risk that synthetic sampling from contaminated or anomalous minority points distorts the classifier’s ability to generalize (Baesens et al., 2020). The method involves:
- Robust Estimation: The Fast Minimum Covariance Determinant (MCD) estimator provides a robust estimate of the minority-class center and scatter , tolerating up to nearly 50% outliers.
- Outlier Detection: For each , the robust Mahalanobis distance is computed:
Points with are excluded (99.9% quantile cutoff), yielding a filtered index set .
- Covariance for Sampling: The sampling covariance is set isotropically using:
- robROSE Sampling Protocol: For desired oversample factor :
- Robust fit: FastMCD
- Outlier filter:
- While :
- Sample uniformly from
- Draw
This exclusion of outliers prior to sampling and use of robust global covariance distinguish robROSE from standard ROSE and other kernel-based oversamplers.
3. Empirical Performance in Simulated and Real-World Experiments
Empirical evaluations employ AUC and AUPRC on held-out splits to quantify classifier performance across various contamination settings:
- Simulated Gaussian Data: For clean data, ROSE, SMOTE, and robROSE yield marginal AUC gains with little AUPRC improvement. With 10% minority-class contamination (outliers drawn from a distant Gaussian), SMOTE and ROSE may increase AUC but can degrade AUPRC. robROSE prevents this degradation, achieving the largest improvements in both metrics (e.g., AUC lift of +0.15, AUPRC for logistic regression from ~0.16 to 0.23).
- Credit Card Fraud Detection (Kaggle): With 284,807 samples (0.17% fraud), after oversampling to 10% minority ratio:
- For CART: AUC increases from 0.899 to 0.910 post-robROSE augmentation; AUPRC remains stable.
- For LR: AUC rises from 0.972 to 0.976 (highest for ROSE); robROSE is on par. AUPRC rises from 0.755 to 0.762 after robROSE or SMOTE.
- Churn Prediction: For Korean corporate churn (original rate 22.6%, downsampled to 5%/1%), LR AUPRC at 5% increases from 0.130 (imbalanced) to 0.206 with robROSE, outperforming SMOTE/ROSE.
A plausible implication is that robROSE’s selective filtering and covariance-aware sampling specifically counteract the negative impacts of outlier-driven synthetic data proliferation, especially as class imbalance or contamination levels increase (Baesens et al., 2020).
4. Algorithmic Parameters and Implementation Considerations
- Bandwidth Hyperparameter : Values in are effective; is typically set as , or by robust featurewise scaling.
- Outlier Cutoff: Default (conservative) can be relaxed to if contamination structure is uncertain.
- MCD Breakdown Point: Standard 50%; can be reduced (e.g., 25%) for reduced anomaly rates.
- Categorical Handling: For mixed data, synthetic inherits categorical levels from seed , while continuous components are sampled.
- Model-Agnostic Utility: robROSE is not constrained to a specific classifier—applicable with trees, SVMs, neural networks, etc.
Best practices include reserving an untouched test set and avoiding synthetic point contamination during evaluation.
5. Comparative Analysis with Related Sampling Techniques
| Method | Outlier Robustness | Covariance Structure |
|---|---|---|
| ROSE | None | Local/featurewise |
| SMOTE | None | Linear interpolation |
| robROSE | MCD-based filter | Robust, global |
All three methods—ROSE, SMOTE, robROSE—improve class balance, but only robROSE explicitly resists outlier-driven artifacts and retains global covariance features, thereby yielding more stable and interpretable improvements in both AUC and AUPRC in contaminated and uncontaminated regimes (Baesens et al., 2020).
6. Practical Guidelines and Limitations
- Parameter Tuning: Hyperparameters and cutoff quantile can be optimized via cross-validation.
- Robustness to Anomalies: robROSE’s pre-filtering effectively safeguards against minority-class contamination.
- Final Evaluation: Only original (non-synthetic) test data should be used for model evaluation, preserving the integrity of generalization assessments.
These protocols ensure that random over-sampling mechanisms, especially in their robust forms, contribute to improved minority-class learnability without sacrificing diagnostic specificity or model interpretability.