RF-LGBM: Ensemble for Repurchase Prediction
- RF-LGBM is an ensemble framework that integrates Random Forest and LightGBM via soft-voting to effectively predict customer repurchase behavior in imbalanced datasets.
- It leverages an enhanced five-feature RFM model and a SMOTE-ENN pipeline to improve data balance and model robustness.
- Hyperparameter optimization using TPE accelerates training by over 450% while achieving superior F1 scores compared to individual models.
Random Forest with LightGBM (RF-LGBM) is an ensemble learning framework specifically developed to predict customer repurchase behavior in community e-commerce platforms, integrating Random Forest (RF) and LightGBM models via a soft-voting strategy. It includes novel advances in feature engineering, sample balancing, and hyperparameter optimization, yielding improved classification performance on highly imbalanced behavioral data (Yang et al., 2021).
1. Composition and Architecture of RF-LGBM
RF-LGBM is composed of two base learners:
- Random Forest (RF): Trained on binary labels , RF generates bootstrapped samples from the training set and constructs a binary decision tree for each. At each split, RF selects a random subset of features and minimizes an impurity criterion (e.g., Gini). The class probability estimate is
where .
- LightGBM: A gradient-boosted ensemble, fitting trees sequentially via optimization of a regularized loss. The raw score is , with probability output
Training uses second-order Taylor expansion of , as well as GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for computational efficiency.
The two outputs are combined by a weighted average, yielding an ensemble probability:
with (equal weights) in reported experiments.
Final classification is by thresholding at : label 1 if , else 0 (Yang et al., 2021).
2. Enhanced Feature Engineering: Improved RFM Model
RF-LGBM leverages an improved RFM (Recency, Frequency, Monetary) model expanded to five indicators for fine-grained description of purchase behavior:
- (Recency): , time since last purchase.
- (Frequency): Count of purchases within reference window.
- (Monetary): , total purchase value.
- (Span): , customer-product relationship duration.
- (Average Inter-purchase Interval): .
The conceptual rationale is that lower and denote more recent or habitual buying, while higher , , indicate loyalty and high-value customers.
3. Class Imbalance Handling: SMOTE-ENN Pipeline
The native prevalence of positive (repurchase) instances is approximately 9%. To address this, RF-LGBM applies a two-stage balancing technique:
- SMOTE (Synthetic Minority Over-sampling Technique): For each minority sample , nearest neighbors are selected, and synthetic samples generated by
- ENN (Edited Nearest Neighbor): Cleansing stage that applies a 5-NN classifier to the augmented data; samples with disagreement between observed and predicted class are removed.
Following SMOTE-ENN, the class distribution is approximately balanced: k positives to k negatives (1:1).
4. Hyperparameter Optimization via TPE
Hyperparameter search is automated via the Tree-structured Parzen Estimator (TPE). For each learner, the search space includes:
- RF: UniformInteger[50,500], UniformInteger[5,30], gini, entropy, UniformInteger[1, #features].
- Objective: maximize F on a held-out validation set.
TPE models by densities (good trials, ) and (bad, ) and maximizes expected improvement:
where .
Empirical comparison shows TPE tuning completes in 31.25 s versus s (random search) and s (grid), with more than 450% speed-up and higher F.
5. Predictive Performance and Experimental Comparison
The following summarizes performance (mean test set metrics, 10 random splits):
| Model | Accuracy | Recall | F | Source |
|---|---|---|---|---|
| RF-LGBM | 0.871 | 0.952 | 0.859 | (Yang et al., 2021) |
| RF | 0.862 | 0.911 | 0.853 | (Yang et al., 2021) |
| LightGBM | 0.858 | 0.951 | 0.842 | (Yang et al., 2021) |
| XGBoost | -- | -- | ~0.83 | (Yang et al., 2021) |
| CNN-LSTM | 0.856 | 0.839 | 0.847 | H. Xiaoli et al. |
| LSTM | 0.802 | 0.842 | 0.822 | H. Xiaoli et al. |
| Teacher-Student | 0.9198 | -- | -- | Shen et al. |
RF-LGBM outperforms both constituent single models and published CNN-LSTM and LSTM baselines in terms of F metric. This suggests that probabilistic soft-voting between random forests and gradient-boosted trees, when rigorously optimized and balanced, produces robust results in imbalanced structured behavioral data domains.
6. Methodological Significance and Application Context
By combining: (1) an improved five-feature RFM profile, (2) SMOTE-ENN balancing, (3) TPE-driven hyperparameter tuning, and (4) soft-voting ensemble of RF and LightGBM, RF-LGBM delivers a repurchase predictor with both improved F and over 4 reduction in training time compared to traditional search strategies. The methodology is particularly suited to community e-commerce, characterized by volatile customer loyalty and severe class imbalance (Yang et al., 2021). A plausible implication is that this ensemble strategy can generalize to other domains exhibiting similar heterogeneity in behavioral data and label imbalance.