Random Over-Sampling (ROS) Methods

Updated 4 February 2026

Random Over-Sampling (ROS) is a technique that generates synthetic minority examples using kernel density estimation to mitigate severe class imbalance.
Extensions like robROSE apply robust covariance estimation and outlier filtering to preserve data integrity and enhance model stability.
Empirical studies demonstrate that robust ROS variants can significantly improve AUC and AUPRC in fraud detection, churn prediction, and other critical applications.

Random Over-Sampling (ROS) and its robust extensions are synthetic data generation techniques designed to alleviate class imbalance in supervised learning, particularly in high-stakes settings such as fraud detection, where minority class prevalence may fall below 0.5%. These approaches augment the minority class by generating synthetic observations that aim to preserve the feature distribution, thereby improving classifier sensitivity without excessively distorting the underlying data manifold (Baesens et al., 2020).

1. Foundations of ROSE (Random Over-Sampling Examples)

ROSE addresses severe class imbalance by creating synthetic minority examples through localized kernel density estimation. Let $X_1 = \{x_1, \ldots, x_{n_1}\} \subset \mathbb{R}^{p}$ denote the minority-class training data ( $n_1 \ll n_0$ , $n_0$ being the number of majority points). The ROSE mechanism proceeds as:

Kernel Density Estimation: For a point $x$ , the minority-class distribution is estimated as:

$\hat{f}(x) = \frac{1}{n_1} \sum_{i=1}^{n_1} K_H(x - x_i)$

where $K_H$ is a $p$ -variate Gaussian kernel with a symmetric positive-definite smoothing matrix $H \in \mathbb{R}^{p \times p}$ . The standard choice for the Gaussian kernel is:

$K_H(u) = (2\pi)^{-p/2} |H|^{-1} \exp\left[-\frac{1}{2}u^T H^{-1} u\right]$

Bandwidth Selection: The smoothing parameter $H$ is critical. A diagonal $H$ is built as:

$H = \text{diag}\left((hc\sigma_j)^2\right)_{j=1}^p$

with $\sigma_j$ the sample standard deviation of feature $j$ , $h > 0$ a tuning parameter ( $h \approx 0.3$ –$0.7$), and $c = \left(4/((p+2)n_1)\right)^{1/(p+4)}$ .

Synthetic Sample Generation: Each synthetic minority-class point $z$ is drawn by selecting $x_i$ uniformly at random and sampling $\varepsilon \sim N_p(0, H)$ , then setting $z = x_i + \varepsilon$ . This process is repeated until the minority class reaches the desired cardinality.

2. The robROSE Methodology for Robust Random Over-Sampling

robROSE extends ROSE by enhancing robustness to outliers, targeting the pervasive risk that synthetic sampling from contaminated or anomalous minority points distorts the classifier’s ability to generalize (Baesens et al., 2020). The method involves:

Robust Estimation: The Fast Minimum Covariance Determinant (MCD) estimator provides a robust estimate of the minority-class center $\hat{\mu}_1$ and scatter $\hat{\Sigma}_1$ , tolerating up to nearly 50% outliers.
Outlier Detection: For each $x_i \in X_1$ , the robust Mahalanobis distance is computed:

$\text{MD}_i^2 = (x_i - \hat{\mu}_1)^T \hat{\Sigma}_1^{-1} (x_i - \hat{\mu}_1)$

Points with $\text{MD}_i^2 > \chi^2_{p,0.999}$ are excluded (99.9% quantile cutoff), yielding a filtered index set $I$ .

Covariance for Sampling: The sampling covariance is set isotropically using:

$H = h \cdot c \cdot I_p, \qquad \Sigma_x = H \hat{\Sigma}_1 H$

robROSE Sampling Protocol: For desired oversample factor $T$ :

Robust fit: $(\hat{\mu}_1, \hat{\Sigma}_1) \leftarrow$ FastMCD $(X_1)$
Outlier filter: $I = \left\{i : \text{MD}_i^2 < \chi^2_{p,0.999}\right\}$
While $|Z| < T n_1 - n_1$ $∣ Z ∣ < T n_{1} - n_{1}$ :
- Sample $j$ uniformly from $I$
- Draw $z \sim N_p(x_j, \Sigma_x)$
- $Z \leftarrow Z \cup \{z\}$

This exclusion of outliers prior to sampling and use of robust global covariance distinguish robROSE from standard ROSE and other kernel-based oversamplers.

3. Empirical Performance in Simulated and Real-World Experiments

Empirical evaluations employ AUC and AUPRC on held-out splits to quantify classifier performance across various contamination settings:

Simulated Gaussian Data: For clean data, ROSE, SMOTE, and robROSE yield marginal AUC gains with little AUPRC improvement. With 10% minority-class contamination (outliers drawn from a distant Gaussian), SMOTE and ROSE may increase AUC but can degrade AUPRC. robROSE prevents this degradation, achieving the largest improvements in both metrics (e.g., AUC lift of +0.15, AUPRC for logistic regression from ~0.16 to 0.23).
Credit Card Fraud Detection (Kaggle): With 284,807 samples (0.17% fraud), after oversampling to 10% minority ratio:
- For CART: AUC increases from 0.899 to 0.910 post-robROSE augmentation; AUPRC remains stable.
- For LR: AUC rises from 0.972 to 0.976 (highest for ROSE); robROSE is on par. AUPRC rises from 0.755 to 0.762 after robROSE or SMOTE.
Churn Prediction: For Korean corporate churn (original rate 22.6%, downsampled to 5%/1%), LR AUPRC at 5% increases from 0.130 (imbalanced) to 0.206 with robROSE, outperforming SMOTE/ROSE.

A plausible implication is that robROSE’s selective filtering and covariance-aware sampling specifically counteract the negative impacts of outlier-driven synthetic data proliferation, especially as class imbalance or contamination levels increase (Baesens et al., 2020).

4. Algorithmic Parameters and Implementation Considerations

Bandwidth Hyperparameter $h$ : Values in $[0.3, 0.7]$ are effective; $H$ is typically set as $h c I_p$ , or by robust featurewise scaling.
Outlier Cutoff: Default $\chi^2_{p,0.999}$ (conservative) can be relaxed to $\chi^2_{p,0.975}$ if contamination structure is uncertain.
MCD Breakdown Point: Standard 50%; can be reduced (e.g., 25%) for reduced anomaly rates.
Categorical Handling: For mixed data, synthetic $z$ inherits categorical levels from seed $x_j$ , while continuous components are sampled.
Model-Agnostic Utility: robROSE is not constrained to a specific classifier—applicable with trees, SVMs, neural networks, etc.

Best practices include reserving an untouched test set and avoiding synthetic point contamination during evaluation.

Method	Outlier Robustness	Covariance Structure
ROSE	None	Local/featurewise
SMOTE	None	Linear interpolation
robROSE	MCD-based filter	Robust, global

All three methods—ROSE, SMOTE, robROSE—improve class balance, but only robROSE explicitly resists outlier-driven artifacts and retains global covariance features, thereby yielding more stable and interpretable improvements in both AUC and AUPRC in contaminated and uncontaminated regimes (Baesens et al., 2020).

6. Practical Guidelines and Limitations

Parameter Tuning: Hyperparameters $h$ and cutoff quantile can be optimized via cross-validation.
Robustness to Anomalies: robROSE’s pre-filtering effectively safeguards against minority-class contamination.
Final Evaluation: Only original (non-synthetic) test data should be used for model evaluation, preserving the integrity of generalization assessments.

These protocols ensure that random over-sampling mechanisms, especially in their robust forms, contribute to improved minority-class learnability without sacrificing diagnostic specificity or model interpretability.

Markdown Upgrade to Chat

References (1)

robROSE: A robust approach for dealing with imbalanced data in fraud detection (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Over-Sampling (ROS).