Papers
Topics
Authors
Recent
Search
2000 character limit reached

Taiwan Credit Default Dataset Overview

Updated 2 February 2026
  • The Taiwan Credit Default Dataset is an open-access benchmark containing detailed socio-demographic and financial attributes of 30,000 credit card clients, each annotated with a binary default status.
  • Preprocessing steps include min-max normalization, log transformations for skewed bill and payment amounts, and outlier clipping to stabilize machine learning model training.
  • The dataset has advanced research in explainable ML and fairness analysis, with studies on ARD techniques and proxy leakage that enhance credit risk prediction and interpretability.

The Taiwan Credit Default Dataset (also referred to as the Default of Credit Card Clients Dataset) is a canonical open-access dataset for research in credit risk modeling, algorithmic fairness, and explainable machine learning. Compiled from the 2005 portfolio of a major Taiwanese bank, it comprises detailed socio-demographic and financial attributes for 30,000 credit-card holders, each annotated with a binary indicator of subsequent default status. This dataset has become, via the UCI Machine Learning Repository, a critical benchmark for evaluating supervised learning techniques, feature selection strategies, and algorithmic audit methodologies in credit scoring contexts (Mbuvha et al., 2019, SD et al., 26 Jan 2026).

1. Structure and Content of the Dataset

The dataset captures a “snapshot” view of each client, reflecting demographic variables and six months of historical credit card activity, organized as follows:

Feature Group Attributes (Xi) Description / Coding
Socio-demographic X₂: SEX; X₃: EDUCATION; X₄: MARRIAGE; X₅: AGE SEX (1=male, 2=female); EDUCATION (1–4); MARRIAGE (1–3); AGE (years)
Financial X₁: LIMIT_BAL; X₆–X₁₁: PAY₀…PAY₅ LIMIT_BAL (credit); PAY_t (–2…8 months delayed, ordinal)
Statement amounts X₁₂–X₁₇: BILL_AMT₁…BILL_AMT₆ Continuous, NT\$
Payment amounts X₁₈–X₂₃: PAY_AMT₁…PAY_AMT₆ Continuous, NT\$
Target Y: default payment next month 0 = no default, 1 = default

The final data matrix has structure D={(x(i),y(i))}i=130000D = \{ (x^{(i)}, y^{(i)}) \}_{i=1}^{30\,000}, with xR23x \in \mathbb{R}^{23} (mixed type) and y{0,1}y \in \{0,1\}. The target attribute (“default payment next month”) exhibits moderate imbalance: approximately 22,500 non-defaults (75%) and 7,500 defaults (25%) (Mbuvha et al., 2019, SD et al., 26 Jan 2026).

2. Data Preprocessing and Transformation

Preprocessing steps in the literature follow rigorous protocols to facilitate robust statistical learning and fair evaluation:

  • Missing Data: The released dataset contains no missing values; all records are complete across features.
  • Feature Scaling: Attributes are typically projected onto [0,1][0,1] with min-max normalization as Xj=(Xjmin(Xj))/(max(Xj)min(Xj))X_j' = (X_j - \min(X_j)) / (\max(X_j) - \min(X_j)) (Mbuvha et al., 2019). For severe skew in bill and payment amounts, a log(1+x)\log(1 + x) transform is also employed (SD et al., 26 Jan 2026).
  • Categorical Encoding: SEX, EDUCATION, and MARRIAGE are either coded as integer, min-max-scaled, or one-hot encoded based on modeling requirements.
  • Outlier Handling: BILL_AMT and PAY_AMT are clipped or transformed at the 99.9th percentile to stabilize model training (SD et al., 26 Jan 2026).
  • Target Split: Splits such as 70/30 or stratified 80/20 are reported for holdout evaluation, with test set default prevalence \approx 17–25% (Mbuvha et al., 2019, SD et al., 26 Jan 2026).
  • Balancing: Techniques such as class reweighting, SMOTE oversampling, and random undersampling are incorporated to address label imbalance in the training data (SD et al., 26 Jan 2026).

3. Applications in Credit Risk and Bayesian Neural Networks

The dataset has served as a benchmark for a range of supervised learning paradigms; notably, it underpins methodological advances in uncertainty-quantified, interpretable neural models (Mbuvha et al., 2019). Key contributions include:

  • Model Architecture: Input features feed into a single-hidden-layer multilayer perceptron (MLP) with 5 units. Formally,

hj(x)=Ψ(zj+i=123wijxi),o(x)=b+j=15vjhj(x)h_j(x) = \Psi\left(z_j + \sum_{i=1}^{23} w_{ij} x_i\right), \quad o(x) = b + \sum_{j=1}^5 v_j h_j(x)

with o(x)o(x) mapped to probability via a logistic sigmoid.

  • Training Paradigms:
    • Gaussian Approximation: Posterior over weights P(wD,α,β)P(w \mid D, \alpha, \beta) is approximated by N(wMP,A1)\mathcal{N}(w_{MP}, A^{-1}), with A=2M(w)A = \nabla^2 M(w).
    • Hybrid Monte Carlo (HMC): HMC samples the Bayesian posterior in weight space exactly (Algorithm 1 in (Mbuvha et al., 2019)).
  • Automatic Relevance Determination (ARD): Each input group receives distinct precisions αc\alpha_c; posterior

P(wD,{αc},β,H)exp[βED(w)cαcEWc(w)]P(w \mid D, \{\alpha_c\},\beta,H) \propto \exp\left[ -\beta E_D(w) - \sum_c \alpha_c E_{W_c}(w) \right]

The adaptively learned αc\alpha_c values directly indicate feature importance: large αc\alpha_c values suppress irrelevant inputs.

  • Key Feature Relevance: Both HMC‐ARD and Gaussian‐ARD approaches highlight repayment status from the most recent and 3-months-prior periods (PAY₀, PAY₃), credit limit (LIMIT_BAL), and amount of payment two months ago (PAY_AMT₂) as having highest relevance. EDUCATION is also noted as moderately influential.

Predictive performance is maximized with ARD-equipped BNNs: reported AUCs are HMC‐ARD 0.7783 and Gaussian‐ARD 0.7753, both substantially above the non-ARD baseline at 0.7079 (Mbuvha et al., 2019).

4. Fairness Analysis and Proxy Leakage

Standard "fairness through blindness" interventions—removal of protected attributes such as "SEX"—are ineffective in eradicating discriminatory structure in predictions. In-depth audits using SHapley Additive exPlanations (SHAP) and adversarial inverse modeling prove that non-sensitive features, chiefly Marital Status, Age, and Limit_BAL, function as latent gender proxies (SD et al., 26 Jan 2026).

  • SHAP Results: Analysis on an XGBoost model trained without explicit gender produces normalized mean absolute SHAP values showing Marital Status (≈ 0.195), Age (≈ 0.160), and LIMIT_BAL (≈ 0.110) as primary contributors to prediction. Disaggregation by gender reveals, e.g., Age carries substantially more predictive weight for women than for men (0.23 vs. 0.12 in SHAP magnitude), indicating gendered information persists post-redaction.
  • Proxy Leakage Quantification: Adversarial models trained to reconstruct the gender label from "neutral" features achieve ROC AUC ≈ 0.65 (well above chance), confirming the dataset contains structurally encoded gender signals beyond explicit indicators.
Feature SHAP_male SHAP_female
Marital Status 0.18 0.24
Age 0.12 0.23
LIMIT_BAL 0.11 0.09

This leakage demonstrates that standard fairness audits (e.g., Disparate Impact ≈ 1.0, Equalized Odds ≈ 0) are inadequate to detect or correct underlying bias.

5. Methodological and Social Implications

The Taiwan Credit Default Dataset has catalyzed major methodological advances and critical fairness critiques in credit risk research:

  • Uncertainty and Interpretability: The use of Bayesian neural models with ARD provides not only precision-recall optimization but also actionable, ranked insights into feature relevance, supporting compliance and consumer-facing transparency requirements (Mbuvha et al., 2019).
  • Algorithmic Bias: The exposure of "proxy features" challenges the sufficiency of attribute removal and unawareness. True fairness, the evidence suggests, requires causal-aware modeling and regular audits for proxy leakage via adversarial or explainable AI frameworks (SD et al., 26 Jan 2026).
  • Recommendations: Suggested best practices now include structural accountability (tracking and quantifying proxy leakage), direct incorporation of causal regularization or counterfactual fairness objectives, and validation of results across heterogeneous and international datasets to rule out spurious proxy effects.

6. Limitations and Outlook

While the Taiwan Credit Default Dataset remains integral to model benchmarking, several limitations constrain generalizability:

  • Single Region, Single Snapshot: Findings are potentially idiosyncratic to Taiwanese socioeconomic structure and may not port directly to contexts with different cultural, regulatory, or temporal profiles.
  • Static Data: The lack of longitudinal or dynamic updates restricts studies of temporal drift and evolving proxy relationships.
  • Feature Granularity: No detailed behavioral or transactional data beyond six-month aggregates is present; causal inferences are thus only partially grounded.

Future research directions highlighted include the construction of formal causal graphs to isolate and mitigate proxy leakage, integration of temporal credit histories, and external validation on non-Taiwanese cohorts (SD et al., 26 Jan 2026). These avenues aim to both extend methodological rigor and ensure genuine equity in credit risk modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Taiwan Credit Default Dataset.