RF-LGBM: Ensemble for Repurchase Prediction

Updated 1 January 2026

RF-LGBM is an ensemble framework that integrates Random Forest and LightGBM via soft-voting to effectively predict customer repurchase behavior in imbalanced datasets.
It leverages an enhanced five-feature RFM model and a SMOTE-ENN pipeline to improve data balance and model robustness.
Hyperparameter optimization using TPE accelerates training by over 450% while achieving superior F1 scores compared to individual models.

Random Forest with LightGBM (RF-LGBM) is an ensemble learning framework specifically developed to predict customer repurchase behavior in community e-commerce platforms, integrating Random Forest (RF) and LightGBM models via a soft-voting strategy. It includes novel advances in feature engineering, sample balancing, and hyperparameter optimization, yielding improved classification performance on highly imbalanced behavioral data (Yang et al., 2021).

1. Composition and Architecture of RF-LGBM

RF-LGBM is composed of two base learners:

Random Forest (RF): Trained on binary labels $\{y_i\}_{i=1}^n$ , RF generates $M$ bootstrapped samples from the training set and constructs a binary decision tree $h_m(x)$ for each. At each split, RF selects a random subset of features and minimizes an impurity criterion (e.g., Gini). The class probability estimate is

$P_{\text{RF}}(y=1|x) = \frac{1}{M} \sum_{m=1}^M h_m(x)$

where $h_m(x)\in\{0,1\}$ .

LightGBM: A gradient-boosted ensemble, fitting $T$ trees sequentially via optimization of a regularized loss. The raw score is $f(x)=\sum_{t=1}^T f_t(x)$ , with probability output

$P_{\text{LGBM}}(y=1|x) = \sigma(f(x)) = \frac{1}{1 + e^{-f(x)}}$

Training uses second-order Taylor expansion of $L$ , as well as GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for computational efficiency.

The two outputs are combined by a weighted average, yielding an ensemble probability:

$P_\text{ensemble}(y=1\mid x) = \omega P_\text{RF}(y=1\mid x) + (1-\omega) P_\text{LGBM}(y=1\mid x)$

with $\omega = 0.5$ (equal weights) in reported experiments.

Final classification is by thresholding at $\tau=0.5$ : label 1 if $P_\text{ensemble}\geq 0.5$ , else 0 (Yang et al., 2021).

2. Enhanced Feature Engineering: Improved RFM Model

RF-LGBM leverages an improved RFM (Recency, Frequency, Monetary) model expanded to five indicators for fine-grained description of purchase behavior:

$R$ (Recency): $T_\text{last\_time} - T_\text{p\_last\_time}$ , time since last purchase.
$F$ (Frequency): Count of purchases within reference window.
$M$ (Monetary): $\sum_{i=1}^F M_i$ , total purchase value.
$S$ (Span): $T_\text{p\_last\_time} - T_\text{p\_first\_time}$ , customer-product relationship duration.
$T$ (Average Inter-purchase Interval): $S / F$ .

The conceptual rationale is that lower $R$ and $T$ denote more recent or habitual buying, while higher $F$ , $M$ , $S$ indicate loyalty and high-value customers.

3. Class Imbalance Handling: SMOTE-ENN Pipeline

The native prevalence of positive (repurchase) instances is approximately 9%. To address this, RF-LGBM applies a two-stage balancing technique:

SMOTE (Synthetic Minority Over-sampling Technique): For each minority sample $X_i$ , $k=5$ nearest neighbors $\{X_{ij}\}$ are selected, and synthetic samples $Y_j$ generated by

$Y_j = X_i + \mathrm{rand}(0,1) \cdot (X_{ij} - X_i)$

ENN (Edited Nearest Neighbor): Cleansing stage that applies a 5-NN classifier to the augmented data; samples with disagreement between observed and predicted class are removed.

Following SMOTE-ENN, the class distribution is approximately balanced: $\sim 52$ k positives to $\sim 52$ k negatives (1:1).

4. Hyperparameter Optimization via TPE

Hyperparameter search is automated via the Tree-structured Parzen Estimator (TPE). For each learner, the search space includes:

RF: $n_\text{estimators}\sim$ UniformInteger[50,500], $max\_depth\sim$ UniformInteger[5,30], $criterion\in\{$ gini, entropy $\}$ , $max\_features\sim$ UniformInteger[1, #features].
Objective: maximize F $_1$ on a held-out validation set.

TPE models $p(x|y)$ by densities $l(x)$ (good trials, $y<y^*$ ) and $g(x)$ (bad, $y\geq y^*$ ) and maximizes expected improvement:

$EI_{y^*}(x) \propto \left(\gamma + \frac{g(x)}{l(x)}(1-\gamma)\right)^{-1}$

where $\gamma=p(y<y^*)$ .

Empirical comparison shows TPE tuning completes in 31.25 s versus $\sim 150$ s (random search) and $\sim 200$ s (grid), with more than 450% speed-up and higher F $_1$ .

5. Predictive Performance and Experimental Comparison

The following summarizes performance (mean test set metrics, 10 random splits):

Model	Accuracy	Recall	F $_1$	Source
RF-LGBM	0.871	0.952	0.859	(Yang et al., 2021)
RF	0.862	0.911	0.853	(Yang et al., 2021)
LightGBM	0.858	0.951	0.842	(Yang et al., 2021)
XGBoost	--	--	~0.83	(Yang et al., 2021)
CNN-LSTM	0.856	0.839	0.847	H. Xiaoli et al.
LSTM	0.802	0.842	0.822	H. Xiaoli et al.
Teacher-Student	0.9198	--	--	Shen et al.

RF-LGBM outperforms both constituent single models and published CNN-LSTM and LSTM baselines in terms of F $_1$ metric. This suggests that probabilistic soft-voting between random forests and gradient-boosted trees, when rigorously optimized and balanced, produces robust results in imbalanced structured behavioral data domains.

6. Methodological Significance and Application Context

By combining: (1) an improved five-feature RFM profile, (2) SMOTE-ENN balancing, (3) TPE-driven hyperparameter tuning, and (4) soft-voting ensemble of RF and LightGBM, RF-LGBM delivers a repurchase predictor with both improved F $_1$ and over 4 $\times$ reduction in training time compared to traditional search strategies. The methodology is particularly suited to community e-commerce, characterized by volatile customer loyalty and severe class imbalance (Yang et al., 2021). A plausible implication is that this ensemble strategy can generalize to other domains exhibiting similar heterogeneity in behavioral data and label imbalance.

PDF Markdown Chat (Pro)

References (1)

RF-LighGBM: A probabilistic ensemble way to predict customer repurchase behaviour in community e-commerce (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Random Forest with LightGBM (RF-LGBM).

RF-LGBM: Ensemble for Repurchase Prediction

1. Composition and Architecture of RF-LGBM

2. Enhanced Feature Engineering: Improved RFM Model

3. Class Imbalance Handling: SMOTE-ENN Pipeline

4. Hyperparameter Optimization via TPE

5. Predictive Performance and Experimental Comparison

6. Methodological Significance and Application Context

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RF-LGBM: Ensemble for Repurchase Prediction

1. Composition and Architecture of RF-LGBM

2. Enhanced Feature Engineering: Improved RFM Model

3. Class Imbalance Handling: SMOTE-ENN Pipeline

4. Hyperparameter Optimization via TPE

5. Predictive Performance and Experimental Comparison

6. Methodological Significance and Application Context

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research