Hybrid SMOTETomek Resampling Strategy
- Hybrid SMOTETomek resampling is a sequential method that integrates SMOTE for generating synthetic minority samples with Tomek link removal to eliminate ambiguous majority instances.
- The approach improves classifier performance by balancing class distributions and reducing noise near decision boundaries, leading to enhanced minority sensitivity.
- It is widely applied in domains such as clinical risk stratification, intrusion detection, and federated learning, consistently boosting recall and accuracy.
The hybrid SMOTETomek resampling strategy is a sequential data preprocessing method that directly targets the twin challenges of class imbalance and noise near class boundaries in supervised learning. By combining the Synthetic Minority Over-sampling Technique (SMOTE) and Tomek link under-sampling, this approach enables major improvements in classifier sensitivity to minority classes while reducing the impact of ambiguous, hard-to-classify points in the training set. Hybrid SMOTETomek resampling has become integral to high-performance pipelines for imbalanced clinical data, intrusion detection in resource-constrained networks, and multiclass medical risk stratification, and is used extensively in privacy-preserving federated learning and robust ensemble methods (Tertulino, 6 Aug 2025, Talukder et al., 2024, Ovi et al., 9 Jan 2026).
1. Theoretical Underpinnings: SMOTE and Tomek Link Removal
SMOTE generates synthetic minority class samples using linear interpolation. For a set of minority-class samples , and for each , its nearest neighbors , new samples are formed as
This procedure is repeated until the target minority:majority ratio is achieved. Default settings are and a strategy that fully balances minority and majority classes (Ovi et al., 9 Jan 2026, Tertulino, 6 Aug 2025).
After oversampling, Tomek link removal identifies ambiguous samples near the class boundary. In a dataset , samples where , form a Tomek link if they are mutual nearest neighbors with respect to Euclidean distance:
All majority-class members participating in Tomek links are removed; the minority partners are retained (Tertulino, 6 Aug 2025, Ovi et al., 9 Jan 2026, Talukder et al., 2024). This step sharpens class separation by eliminating overlapping majority samples.
2. Algorithmic Pipeline and Implementation
The standard hybrid pipeline applies SMOTE followed by Tomek link removal. Representative pseudocode mirrors the protocol adopted in leading studies:
1 2 3 4 5 6 |
def HybridSMOTETomek(X_train, y_train): # SMOTE Oversampling (k=5, balance all minority classes to target) X_sm, y_sm = SMOTE(k_neighbors=5, sampling_strategy='auto').fit_resample(X_train, y_train) # Tomek Link Undersampling (remove only majority-class points) X_res, y_res = TomekLinks(sampling_strategy='majority').fit_resample(X_sm, y_sm) return X_res, y_res |
- (SMOTE neighbors)
- Oversampling so all minority classes reach target counts (usually matching majority)
- Tomek removal limited to majority-class participants (to avoid minority class shrinkage)
Computational complexity is for both NN (via KD-tree) and mutual nearest neighbor search for Tomek links (Tertulino, 6 Aug 2025).
3. Integration into Machine Learning and Federated Workflows
In federated learning frameworks, hybrid SMOTETomek is applied at the client level as a preprocessing step before gradient computation and privacy mechanisms, such as DP-SGD (per-sample gradient clipping and noise addition), are invoked. This ordering ensures augmentation affects the underlying dataset but not the privacy accounting (Tertulino, 6 Aug 2025).
In ensemble and dual-pipeline architectures, SMOTETomek is inserted after initial normalization and feature filtering but before dimensionality reduction or model fitting. Placement is model-dependent: for tree/nearest-neighbor learners, SMOTETomek is most effective after feature selection; for linear models, it may require tailored insertion to avoid diminishing returns (Ovi et al., 9 Jan 2026).
4. Empirical Impact on Class Distribution and Downstream Performance
The hybrid SMOTETomek strategy consistently equalizes class priors and sharpens the class boundary. Empirical case studies illustrate its impact:
| Domain | Majority:Minority Distribution (Before) | After SMOTETomek | Test Accuracy | Minority Recall |
|---|---|---|---|---|
| Cardiovascular FL (Tertulino, 6 Aug 2025) | Highly imbalanced | Nearly balanced | 72.8–77.0% | 74–77% |
| WSN Intrusion (Talukder et al., 2024) | 340,066:34,595 (binary) | 340,056:339,610 | 99.78–99.92% | 99.78% (bin.) |
| Sleep Disorder (Ovi et al., 9 Jan 2026) | 175:62:62 (majority:minorities) | ~173:171:175 | 98.67% | 97.78–97.92% |
In federated clinical prediction, naïve learning on imbalanced data yielded recall of 0.0%, whereas hybrid SMOTETomek alone increased recall to 74% (with some loss in overall accuracy), and further algorithmic enhancement (FedProx + DP) delivered recall of 77% under reasonable privacy guarantees () (Tertulino, 6 Aug 2025).
Intrusion detection for WSNs demonstrated AUC improvements to nearly 1.0, major reductions in false negatives/positives, and elimination of underfitting/overfitting artifacts (Talukder et al., 2024).
In multiclass biomedical prediction, accuracy increased by 2.7 to >6 percentage points, with minority recall (sensitivity) improving sharply (e.g., KNN: from 91.2% to 97.92%) (Ovi et al., 9 Jan 2026).
5. Mitigating Overfitting, Underfitting, and Class Overlap
SMOTE’s synthetic sample generation mitigates under-representation of the minority class while reducing overfitting compared to simple duplication. Tomek link removal eliminates ambiguous, overlapped majority samples, sharpening the decision boundary and reducing noise (Talukder et al., 2024, Ovi et al., 9 Jan 2026).
Best practices include:
- Not removing minority-class members during Tomek processing (set
sampling_strategy='majority'). - Applying the hybrid procedure after feature filtering but before projection/dimensionality reduction in high-dimensional or deep pipelines, to minimize the risk of synthetic points falling outside the relevant subspace (Ovi et al., 9 Jan 2026).
- Retuning and rebalancing targets per dataset.
- Strict cross-validation with resampling confined to the training set only, especially on small sample sizes.
A plausible implication is that optimal performance gains are model- and pipeline-dependent, and misapplication (e.g., “over-cleaning” the boundary or inappropriate resampling placement) may degrade performance, particularly for linear classifiers.
6. Limitations and Recommendations
Authors caution that model gains with SMOTETomek are largest for classifiers with local decision boundaries (e.g., KNN, tree-based models), but can be neutral or negative for pure linear models without complementary feature filtering. Small sample regimes are particularly susceptible to overfitting from excessive oversampling; robust cross-validation is necessary (Ovi et al., 9 Jan 2026).
Placement within the pipeline should be informed by empirical validation, and Tomek link removal should be restricted to majority-class points unless addressing extreme imbalances. Parameter choices, especially and oversampling strategy, should be dataset-specific.
In summary, the hybrid SMOTETomek resampling strategy is empirically validated across domains as an essential step in constructing robust, sensitive classifiers for imbalanced data, sharply reducing class bias while controlling for noise at class boundaries (Tertulino, 6 Aug 2025, Talukder et al., 2024, Ovi et al., 9 Jan 2026).