Papers
Topics
Authors
Recent
2000 character limit reached

Random Over-Sampling (ROS) Methods

Updated 4 February 2026
  • Random Over-Sampling (ROS) is a technique that generates synthetic minority examples using kernel density estimation to mitigate severe class imbalance.
  • Extensions like robROSE apply robust covariance estimation and outlier filtering to preserve data integrity and enhance model stability.
  • Empirical studies demonstrate that robust ROS variants can significantly improve AUC and AUPRC in fraud detection, churn prediction, and other critical applications.

Random Over-Sampling (ROS) and its robust extensions are synthetic data generation techniques designed to alleviate class imbalance in supervised learning, particularly in high-stakes settings such as fraud detection, where minority class prevalence may fall below 0.5%. These approaches augment the minority class by generating synthetic observations that aim to preserve the feature distribution, thereby improving classifier sensitivity without excessively distorting the underlying data manifold (Baesens et al., 2020).

1. Foundations of ROSE (Random Over-Sampling Examples)

ROSE addresses severe class imbalance by creating synthetic minority examples through localized kernel density estimation. Let X1={x1,,xn1}RpX_1 = \{x_1, \ldots, x_{n_1}\} \subset \mathbb{R}^{p} denote the minority-class training data (n1n0n_1 \ll n_0, n0n_0 being the number of majority points). The ROSE mechanism proceeds as:

  • Kernel Density Estimation: For a point xx, the minority-class distribution is estimated as:

f^(x)=1n1i=1n1KH(xxi)\hat{f}(x) = \frac{1}{n_1} \sum_{i=1}^{n_1} K_H(x - x_i)

where KHK_H is a pp-variate Gaussian kernel with a symmetric positive-definite smoothing matrix HRp×pH \in \mathbb{R}^{p \times p}. The standard choice for the Gaussian kernel is:

KH(u)=(2π)p/2H1exp[12uTH1u]K_H(u) = (2\pi)^{-p/2} |H|^{-1} \exp\left[-\frac{1}{2}u^T H^{-1} u\right]

  • Bandwidth Selection: The smoothing parameter HH is critical. A diagonal HH is built as:

H=diag((hcσj)2)j=1pH = \text{diag}\left((hc\sigma_j)^2\right)_{j=1}^p

with σj\sigma_j the sample standard deviation of feature jj, h>0h > 0 a tuning parameter (h0.3h \approx 0.3–$0.7$), and c=(4/((p+2)n1))1/(p+4)c = \left(4/((p+2)n_1)\right)^{1/(p+4)}.

  • Synthetic Sample Generation: Each synthetic minority-class point zz is drawn by selecting xix_i uniformly at random and sampling εNp(0,H)\varepsilon \sim N_p(0, H), then setting z=xi+εz = x_i + \varepsilon. This process is repeated until the minority class reaches the desired cardinality.

2. The robROSE Methodology for Robust Random Over-Sampling

robROSE extends ROSE by enhancing robustness to outliers, targeting the pervasive risk that synthetic sampling from contaminated or anomalous minority points distorts the classifier’s ability to generalize (Baesens et al., 2020). The method involves:

  • Robust Estimation: The Fast Minimum Covariance Determinant (MCD) estimator provides a robust estimate of the minority-class center μ^1\hat{\mu}_1 and scatter Σ^1\hat{\Sigma}_1, tolerating up to nearly 50% outliers.
  • Outlier Detection: For each xiX1x_i \in X_1, the robust Mahalanobis distance is computed:

MDi2=(xiμ^1)TΣ^11(xiμ^1)\text{MD}_i^2 = (x_i - \hat{\mu}_1)^T \hat{\Sigma}_1^{-1} (x_i - \hat{\mu}_1)

Points with MDi2>χp,0.9992\text{MD}_i^2 > \chi^2_{p,0.999} are excluded (99.9% quantile cutoff), yielding a filtered index set II.

  • Covariance for Sampling: The sampling covariance is set isotropically using:

H=hcIp,Σx=HΣ^1HH = h \cdot c \cdot I_p, \qquad \Sigma_x = H \hat{\Sigma}_1 H

  • robROSE Sampling Protocol: For desired oversample factor TT:
  1. Robust fit: (μ^1,Σ^1)(\hat{\mu}_1, \hat{\Sigma}_1) \leftarrow FastMCD(X1)(X_1)
  2. Outlier filter: I={i:MDi2<χp,0.9992}I = \left\{i : \text{MD}_i^2 < \chi^2_{p,0.999}\right\}
  3. While Z<Tn1n1|Z| < T n_1 - n_1:
    • Sample jj uniformly from II
    • Draw zNp(xj,Σx)z \sim N_p(x_j, \Sigma_x)
    • ZZ{z}Z \leftarrow Z \cup \{z\}

This exclusion of outliers prior to sampling and use of robust global covariance distinguish robROSE from standard ROSE and other kernel-based oversamplers.

3. Empirical Performance in Simulated and Real-World Experiments

Empirical evaluations employ AUC and AUPRC on held-out splits to quantify classifier performance across various contamination settings:

  • Simulated Gaussian Data: For clean data, ROSE, SMOTE, and robROSE yield marginal AUC gains with little AUPRC improvement. With 10% minority-class contamination (outliers drawn from a distant Gaussian), SMOTE and ROSE may increase AUC but can degrade AUPRC. robROSE prevents this degradation, achieving the largest improvements in both metrics (e.g., AUC lift of +0.15, AUPRC for logistic regression from ~0.16 to 0.23).
  • Credit Card Fraud Detection (Kaggle): With 284,807 samples (0.17% fraud), after oversampling to 10% minority ratio:
    • For CART: AUC increases from 0.899 to 0.910 post-robROSE augmentation; AUPRC remains stable.
    • For LR: AUC rises from 0.972 to 0.976 (highest for ROSE); robROSE is on par. AUPRC rises from 0.755 to 0.762 after robROSE or SMOTE.
  • Churn Prediction: For Korean corporate churn (original rate 22.6%, downsampled to 5%/1%), LR AUPRC at 5% increases from 0.130 (imbalanced) to 0.206 with robROSE, outperforming SMOTE/ROSE.

A plausible implication is that robROSE’s selective filtering and covariance-aware sampling specifically counteract the negative impacts of outlier-driven synthetic data proliferation, especially as class imbalance or contamination levels increase (Baesens et al., 2020).

4. Algorithmic Parameters and Implementation Considerations

  • Bandwidth Hyperparameter hh: Values in [0.3,0.7][0.3, 0.7] are effective; HH is typically set as hcIph c I_p, or by robust featurewise scaling.
  • Outlier Cutoff: Default χp,0.9992\chi^2_{p,0.999} (conservative) can be relaxed to χp,0.9752\chi^2_{p,0.975} if contamination structure is uncertain.
  • MCD Breakdown Point: Standard 50%; can be reduced (e.g., 25%) for reduced anomaly rates.
  • Categorical Handling: For mixed data, synthetic zz inherits categorical levels from seed xjx_j, while continuous components are sampled.
  • Model-Agnostic Utility: robROSE is not constrained to a specific classifier—applicable with trees, SVMs, neural networks, etc.

Best practices include reserving an untouched test set and avoiding synthetic point contamination during evaluation.

Method Outlier Robustness Covariance Structure
ROSE None Local/featurewise
SMOTE None Linear interpolation
robROSE MCD-based filter Robust, global

All three methods—ROSE, SMOTE, robROSE—improve class balance, but only robROSE explicitly resists outlier-driven artifacts and retains global covariance features, thereby yielding more stable and interpretable improvements in both AUC and AUPRC in contaminated and uncontaminated regimes (Baesens et al., 2020).

6. Practical Guidelines and Limitations

  • Parameter Tuning: Hyperparameters hh and cutoff quantile can be optimized via cross-validation.
  • Robustness to Anomalies: robROSE’s pre-filtering effectively safeguards against minority-class contamination.
  • Final Evaluation: Only original (non-synthetic) test data should be used for model evaluation, preserving the integrity of generalization assessments.

These protocols ensure that random over-sampling mechanisms, especially in their robust forms, contribute to improved minority-class learnability without sacrificing diagnostic specificity or model interpretability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Over-Sampling (ROS).