XGBoost in Predictive Credit Risk Modeling

Updated 16 November 2025

The paper demonstrates that incorporating PCA with SMOTEENN in XGBoost can dramatically boost predictive accuracy in highly imbalanced credit risk datasets, achieving near-perfect F1-scores and AUC-ROC scores.
The methodology leverages XGBoost’s second-order optimization and explicit regularization to manage missing values and reduce overfitting in complex tabular data.
The approach emphasizes a structured data pipeline—using label encoding, feature selection, and hybrid resampling—to effectively balance classes and optimize model performance.

Predictive modeling using XGBoost refers to the application of the eXtreme Gradient Boosting library for supervised learning tasks—in particular, high-dimensional binary classification in structured credit risk, as exemplified by the comparative paper of “Advanced User Credit Risk Prediction Model using LightGBM, XGBoost and Tabnet with SMOTEENN” (Yu et al., 7 Aug 2024). In this paradigm, XGBoost’s tree boosting architecture is leveraged for its robust regularization, efficient second-order split finding, intrinsic handling of missingness, and ability to integrate seamlessly with class-imbalance stratagems, dimensionality reduction, and synthetic sampling. The following sections review the pipeline, mathematical underpinnings, evaluation, and deployment considerations specific to XGBoost in this high-stakes tabular credit modeling context.

1. Data Architecture and Preprocessing Strategy

The core dataset in (Yu et al., 7 Aug 2024) comprises approximately 40,000 credit applications, each with several dozen numerical and categorical descriptors (such as age, income, and credit history). Key challenges include a severe class imbalance (45,318 “approved” to only 667 “not approved” labels), heterogeneous variable types, and substantial missingness.

Preprocessing for XGBoost is distinctive in several respects:

Missing Value Handling: No external imputation is performed; missing values are passed as-is to XGBoost’s split learning, which natively infers missing-value directions during tree construction.
Categorical Encoding: All categorical variables are label-encoded (integer-mapped), exploiting the boosting engine’s ability to efficiently handle integer splits.
Class Imbalance: The pipeline comprises three sequential steps:
- Random under-sampling of the majority (“approved”) class to induce baseline balance.
- Principal Component Analysis (PCA) is fit on the under-sampled dataset, retaining >95% explained variance (specific component count not given).
- The SMOTEENN algorithm is run on the PCA-transformed space: first SMOTE (k=5) generates synthetic minority samples; Edited Nearest Neighbor (ENN) excises noise from the majority class.
Feature Scaling: No scaling is applied post-PCA, as tree-based methods are agnostic to monotonic transformations.
Feature Selection: Features with low Information Value (IV) are dropped before resampling, ensuring retained predictors are collectively informative for the class variable.

2. XGBoost Objective and Regularization

At core, XGBoost learns an additive ensemble of regression trees $f_k \in \mathcal{F}$ (CART), optimizing a regularized objective:

$L(\theta) = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \sum_{k=1}^t \Omega(f_k)$

$\Omega(f)$ is a regularization penalty for a tree $f$ :

$\Omega(f) = \gamma \cdot T + \frac{1}{2}\lambda \|w\|^2,$

where $T$ is the number of leaves, $w$ is the vector of leaf weights, and $\gamma$ , $\lambda$ are complexity and ridge penalties respectively.

The optimization at each boosting step uses a second-order Taylor approximation:

$L^{(t)} \approx \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2}h_i f_t(x_i)^2] + \Omega(f_t)$

with

$g_i = \frac{\partial}{\partial \hat{y}}^{(t-1)} l(y_i, \hat{y}_i), \qquad h_i = \frac{\partial^2}{\partial \hat{y}^2}^{(t-1)} l(y_i, \hat{y}_i).$

For fixed tree structure, the optimal leaf weights are derived in closed form using accumulated $g_i$ / $h_i$ statistics, and the best split is selected by maximizing the gain formula:

$\text{Gain} = \frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right) - \gamma$

where $G_L, H_L$ and $G_R, H_R$ are the sums of $g_i, h_i$ for left and right children.

3. Hyperparameter Adaptation and Search

“Advanced adaptation and tuning” is reported but the exact search grid or Bayesian optimization scope is not specified in (Yu et al., 7 Aug 2024). Practitioners should consider:

learning_rate ( $\eta$ ): Shrinkage for each tree (typical grid: 0.01, 0.05, 0.1).
max_depth: Maximum tree depth (3, 6, 9).
subsample: Row sampling per tree (e.g., 0.8).
colsample_bytree: Feature sampling per tree (e.g., 0.8).
min_child_weight: Minimum sum of instance weight per node.
gamma ( $\gamma$ ): Minimum loss reduction required to make a split.
lambda ( $\lambda$ ): L2 regularization on leaf weights.
n_estimators: Number of boosting rounds.

Model selection is performed via validation or $k$ -fold cross-validation, optimizing AUC-ROC or (optionally) F1-score. The absence of published optimal hyperparameter values requires manual reproduction via grid or Bayesian optimization.

4. Performance Metrics and Empirical Results

Model performance was evaluated using four complementary metrics—Precision, Recall, F1-Score, and AUC-ROC—across three principal scenarios:

Preprocessing	F1	Recall	Precision	AUC-ROC
Raw data (no PCA/SMOTEENN)	0.6213	0.6213	0.6213	0.6565
PCA only	0.9754	0.9821	0.9725	0.6825
PCA + SMOTEENN	0.9940	0.9940	0.9941	0.9997

After PCA, F1-score jumps from ~0.62 to ~0.98. Incorporation of SMOTEENN further elevates F1 to 0.994 and AUC-ROC to 0.9997. Relative to other ensemble models (LightGBM, Random Forest, CatBoost, TabNet), XGBoost’s predictive accuracy is competitive—falling behind LightGBM only marginally in the best-case (final) scenario.

5. Mechanistic Insights and Practical Recommendations

Dimensionality reduction (PCA) accelerates convergence and reduces overfitting by concentrating information into uncorrelated components; this is reflected in the sharp increase in F1-score and observable stability in validation results.
Hybrid resampling (SMOTEENN) is crucial for managing extreme class imbalance. Oversampling via SMOTE (linear interpolation in k-nearest-neighbor PCA-space) increases the representation of the “not approved” class, while ENN removes ambiguous majority class instances; together, they nearly eliminate false negatives and false positives.
XGBoost’s second-order optimization and explicit regularization ( $\gamma, \lambda$ ) grants robustness even after aggressive synthetic sampling and PCA projection.
Choice between XGBoost and LightGBM: While LightGBM slightly outperforms XGBoost on final (F1, AUC), XGBoost remains attractive where custom loss functions or distributed GPU configurations are needed.

6. End-to-End Workflow and Implementation

The canonical XGBoost modeling pipeline for this risk classification task is:

Ingest the raw applicant data.
Compute IV for all features; exclude those below a discriminatory threshold.
Random under-sampling to balance the class ratio.
Fit PCA to training features; transform both train and test sets to the top $K$ principal components.
On PCA-reduced data, apply SMOTE to oversample minority, followed by ENN to prune majority class noise.
Partition preprocessed data via stratified train/validation split or $k$ -fold cross-validation.
Define hyperparameter search grid; for each parameter tuple: a. Train XGBoost on the training fold. b. Evaluate AUC-ROC (or F1) on validation fold.
Select optimal hyperparameter set based on validation performance.
Retrain XGBoost on the full training data using the best-found settings.
Predict on held-out test set and compute Precision, Recall, F1, and AUC-ROC.

This modular approach, in particular the PCA + SMOTEENN augmentation, is essential for achieving near-perfect accuracy and discrimination in credit approval tasks under extreme imbalance.

7. Limitations, Scaling, and Generalization

Key limitations acknowledged in (Yu et al., 7 Aug 2024):

The number of PCA components is not reported; replication may require empirical tuning to retain >95% variance.
Final XGBoost hyperparameters and seed values are not disclosed.
Operational use requires adaptation to bank-specific business logic (score thresholds, applicant volume).
Scaling to larger datasets or production necessitates row-wise throughput optimization (distributed XGBoost) and potential inclusion of additional regularization (early stopping, monotone splits).

Nonetheless, the procedural and mathematical substrates outlined—label encoding, missing value propagation to XGBoost, hybrid resampling, and second-order regularized boosting—constitute a field-validated template for structured, high-imbalance, real-world predictive modeling in credit risk and analogous domains.

PDF Markdown Chat (Pro)

References (1)

Advanced User Credit Risk Prediction Model using LightGBM, XGBoost and Tabnet with SMOTEENN (2024)

Follow Topic

Get notified by email when new papers are published related to Predictive Modeling Using XGBoost.