Papers
Topics
Authors
Recent
2000 character limit reached

RF-LGBM: Ensemble for Repurchase Prediction

Updated 1 January 2026
  • RF-LGBM is an ensemble framework that integrates Random Forest and LightGBM via soft-voting to effectively predict customer repurchase behavior in imbalanced datasets.
  • It leverages an enhanced five-feature RFM model and a SMOTE-ENN pipeline to improve data balance and model robustness.
  • Hyperparameter optimization using TPE accelerates training by over 450% while achieving superior F1 scores compared to individual models.

Random Forest with LightGBM (RF-LGBM) is an ensemble learning framework specifically developed to predict customer repurchase behavior in community e-commerce platforms, integrating Random Forest (RF) and LightGBM models via a soft-voting strategy. It includes novel advances in feature engineering, sample balancing, and hyperparameter optimization, yielding improved classification performance on highly imbalanced behavioral data (Yang et al., 2021).

1. Composition and Architecture of RF-LGBM

RF-LGBM is composed of two base learners:

  • Random Forest (RF): Trained on binary labels {yi}i=1n\{y_i\}_{i=1}^n, RF generates MM bootstrapped samples from the training set and constructs a binary decision tree hm(x)h_m(x) for each. At each split, RF selects a random subset of features and minimizes an impurity criterion (e.g., Gini). The class probability estimate is

PRF(y=1x)=1Mm=1Mhm(x)P_{\text{RF}}(y=1|x) = \frac{1}{M} \sum_{m=1}^M h_m(x)

where hm(x){0,1}h_m(x)\in\{0,1\}.

  • LightGBM: A gradient-boosted ensemble, fitting TT trees sequentially via optimization of a regularized loss. The raw score is f(x)=t=1Tft(x)f(x)=\sum_{t=1}^T f_t(x), with probability output

PLGBM(y=1x)=σ(f(x))=11+ef(x)P_{\text{LGBM}}(y=1|x) = \sigma(f(x)) = \frac{1}{1 + e^{-f(x)}}

Training uses second-order Taylor expansion of LL, as well as GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for computational efficiency.

The two outputs are combined by a weighted average, yielding an ensemble probability:

Pensemble(y=1x)=ωPRF(y=1x)+(1ω)PLGBM(y=1x)P_\text{ensemble}(y=1\mid x) = \omega P_\text{RF}(y=1\mid x) + (1-\omega) P_\text{LGBM}(y=1\mid x)

with ω=0.5\omega = 0.5 (equal weights) in reported experiments.

Final classification is by thresholding at τ=0.5\tau=0.5: label 1 if Pensemble0.5P_\text{ensemble}\geq 0.5, else 0 (Yang et al., 2021).

2. Enhanced Feature Engineering: Improved RFM Model

RF-LGBM leverages an improved RFM (Recency, Frequency, Monetary) model expanded to five indicators for fine-grained description of purchase behavior:

  • RR (Recency): Tlast_timeTp_last_timeT_\text{last\_time} - T_\text{p\_last\_time}, time since last purchase.
  • FF (Frequency): Count of purchases within reference window.
  • MM (Monetary): i=1FMi\sum_{i=1}^F M_i, total purchase value.
  • SS (Span): Tp_last_timeTp_first_timeT_\text{p\_last\_time} - T_\text{p\_first\_time}, customer-product relationship duration.
  • TT (Average Inter-purchase Interval): S/FS / F.

The conceptual rationale is that lower RR and TT denote more recent or habitual buying, while higher FF, MM, SS indicate loyalty and high-value customers.

3. Class Imbalance Handling: SMOTE-ENN Pipeline

The native prevalence of positive (repurchase) instances is approximately 9%. To address this, RF-LGBM applies a two-stage balancing technique:

  1. SMOTE (Synthetic Minority Over-sampling Technique): For each minority sample XiX_i, k=5k=5 nearest neighbors {Xij}\{X_{ij}\} are selected, and synthetic samples YjY_j generated by

Yj=Xi+rand(0,1)(XijXi)Y_j = X_i + \mathrm{rand}(0,1) \cdot (X_{ij} - X_i)

  1. ENN (Edited Nearest Neighbor): Cleansing stage that applies a 5-NN classifier to the augmented data; samples with disagreement between observed and predicted class are removed.

Following SMOTE-ENN, the class distribution is approximately balanced: 52\sim 52k positives to 52\sim 52k negatives (1:1).

4. Hyperparameter Optimization via TPE

Hyperparameter search is automated via the Tree-structured Parzen Estimator (TPE). For each learner, the search space includes:

  • RF: nestimatorsn_\text{estimators}\sim UniformInteger[50,500], max_depthmax\_depth\sim UniformInteger[5,30], criterion{criterion\in\{gini, entropy}\}, max_featuresmax\_features\sim UniformInteger[1, #features].
  • Objective: maximize F1_1 on a held-out validation set.

TPE models p(xy)p(x|y) by densities l(x)l(x) (good trials, y<yy<y^*) and g(x)g(x) (bad, yyy\geq y^*) and maximizes expected improvement:

EIy(x)(γ+g(x)l(x)(1γ))1EI_{y^*}(x) \propto \left(\gamma + \frac{g(x)}{l(x)}(1-\gamma)\right)^{-1}

where γ=p(y<y)\gamma=p(y<y^*).

Empirical comparison shows TPE tuning completes in 31.25 s versus 150\sim 150 s (random search) and 200\sim 200 s (grid), with more than 450% speed-up and higher F1_1.

5. Predictive Performance and Experimental Comparison

The following summarizes performance (mean test set metrics, 10 random splits):

Model Accuracy Recall F1_1 Source
RF-LGBM 0.871 0.952 0.859 (Yang et al., 2021)
RF 0.862 0.911 0.853 (Yang et al., 2021)
LightGBM 0.858 0.951 0.842 (Yang et al., 2021)
XGBoost -- -- ~0.83 (Yang et al., 2021)
CNN-LSTM 0.856 0.839 0.847 H. Xiaoli et al.
LSTM 0.802 0.842 0.822 H. Xiaoli et al.
Teacher-Student 0.9198 -- -- Shen et al.

RF-LGBM outperforms both constituent single models and published CNN-LSTM and LSTM baselines in terms of F1_1 metric. This suggests that probabilistic soft-voting between random forests and gradient-boosted trees, when rigorously optimized and balanced, produces robust results in imbalanced structured behavioral data domains.

6. Methodological Significance and Application Context

By combining: (1) an improved five-feature RFM profile, (2) SMOTE-ENN balancing, (3) TPE-driven hyperparameter tuning, and (4) soft-voting ensemble of RF and LightGBM, RF-LGBM delivers a repurchase predictor with both improved F1_1 and over 4×\times reduction in training time compared to traditional search strategies. The methodology is particularly suited to community e-commerce, characterized by volatile customer loyalty and severe class imbalance (Yang et al., 2021). A plausible implication is that this ensemble strategy can generalize to other domains exhibiting similar heterogeneity in behavioral data and label imbalance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Random Forest with LightGBM (RF-LGBM).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube