Papers
Topics
Authors
Recent
Search
2000 character limit reached

Istanbul 2025 Q1 Synthetic Dataset

Updated 21 December 2025
  • Istanbul 2025 Q1 Synthetic Dataset is an open, differentially private resource of 100,000 synthetic micro-profiles generated to mirror official socio-demographic and telecom marginals.
  • It employs an iterative proportional fitting and retrieval-augmented pipeline to integrate granular socio-demographic attributes with behavioral features while ensuring statistical fidelity.
  • Empirical evaluations reveal that incorporating alternative non-financial features significantly improves credit risk prediction, promoting safer, inclusive lending strategies.

The Istanbul 2025 Q1 Synthetic Dataset is an open, differentially private resource of 100,000 micro-profiles statistically matched to first quarter 2025 Istanbul population and telecom usage marginals. It underpins empirical evaluations of alternative credit risk modeling for financially excluded or underbanked populations by blending granular socio-demographic attributes with behavioral features derived from telecom and consumer proxies. The dataset and pipeline, introduced in "Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset" (Denknalbant et al., 14 Dec 2025), establish a transparent, reproducible framework for assessing the utility of non-financial features in credit decisioning under contemporary Turkish regulatory and market constraints.

1. Construction Methodology

The synthetic profiles are generated to closely mirror official Q1 2025 distributions published by TÜİK and aggregate telecom behavioral statistics. Generation proceeds in several stages:

  • Marginal Extraction: TÜİK micro-tables provide the empirical distributions for age, education, employment status, occupation, income bands, home district, and home ownership. Telecom operator dashboards supply recharge and call/text statistics per district.
  • Retrieval-Augmented Generation Pipeline: Public marginal summaries are embedded in a vector database. For each attribute synthesis stage (e.g., job-income sampling, device assignment, behavioral covariates), structured prompts and tabular contexts are fed to the OpenAI o3 API, which returns candidate values. These values are accepted into the synthetic record only if they comply with fixed economic sanity rules.
  • Privacy and Statistical Fidelity: Let XX denote the variable set, FF the set of seven socio-demographics, and PfrealP^\mathrm{real}_f and PfsynP^\mathrm{syn}_f the real and synthetic marginals for feature ff. The sampling seeks to minimize total divergence:

minSfFDKL(PfrealPfsyn)\min_S \sum_{f\in F} D_{KL}(P^\mathrm{real}_f \| P^\mathrm{syn}_f)

Sampling is implemented via iterative proportional fitting (IPF) and random draws. To ensure ϵ\epsilon-differential privacy, every synthesized numeric value xx is perturbed:

xsyn=xIPF+Laplace(0,Δ/ϵ)x^\mathrm{syn} = x_\mathrm{IPF} + \mathrm{Laplace}(0, \Delta/\epsilon)

Hard economic filtering ensures, for instance, that individuals above minimum wage are only assigned mid-tier or entry-level phones, and car ownership is conditional on salary exceeding district-specific thresholds.

  • Ground-Truth Label Generation: Seven hybrid rules, combining volatility metrics for employment, device replacement cadence, rent-to-income ratios, and shopping activity, define the binary delinquency label via a scoring function.

2. Dataset Schema and Variable Encoding

The 17-variable schema is partitioned into socio-demographic and behavioral blocks.

  • Socio-demographics (aligned to TÜİK marginals):
  1. age (integer, 18–80)
  2. education (categorical; High School/University/MSc/PhD)
  3. employment_status (Employed, Self-Employed, Unemployed)
  4. job (18 ISCO classes)
  5. monthly_income (TRY, 5,000–200,000)
  6. home_district (39 Istanbul districts)
  7. owns_home (boolean)
  • Alternative behavioral attributes:
  1. phone_model (~50 categorical models)
  2. phone_purchase_date (0–60 months prior)
  3. owns_car (boolean, ~30% true)
  4. car_brand (economy/mid/luxury)
  5. car_purchase_date (0–120 months prior)
  6. owns_credit_card (boolean, ~45% true)
  7. monthly_subscription_cost (TRY, mean ~75)
  8. online_shopping_frequency (0–20 per month)
  9. social_media_active (boolean, ~85% true)

Behavioral and financial variables are cast to standardized, model-ready representations: numerics are z-scored, dates converted to asset age in months, and booleans mapped to {0,1}. Categorical codes are one-hot for linear models or integer-encoded for tree-based methods.

3. Statistical Fidelity and Validation

Synthetic marginal distributions are compared against official marginals using Kullback–Leibler divergence and L1L_1 distance per feature. Reported divergences are:

Feature DKLD_{KL} L1L_1
age bins 0.0021 0.018
education 0.0015 0.012
employment 0.0009 0.007

Maximum L1<0.02L_1 < 0.02 and maximum DKL<0.003D_{KL} < 0.003 across all features. All chi-squared goodness-of-fit tests are non-significant at α=0.05\alpha=0.05, confirming that synthetic profiles adhere to population marginals within sampling error. This fidelity, coupled with the applied differential privacy mechanism, helps ensure both analytical validity and the preservation of personal privacy.

4. Access and Reproducibility

The full dataset (100,000 records, 17 columns) and a modular, open-source Jupyter pipeline are available under an MIT license. The distribution includes:

  • data_generation.ipynb: Orchestrates all generation steps.
  • modeling_pipeline.ipynb: Implements all workflows for CatBoost, LightGBM, and XGBoost.
  • hyperopt_configs/: Contains Bayesian optimization scripts.
  • evaluation/: Provides scripts for metric computation and bootstrap confidence intervals.
  • dice_explanations.ipynb: Supports counterfactual analysis.

Repository: https://github.com/atalaydenknalbant/underbanked_risk_estimation

5. Benchmark Modeling Outcomes

Experiments apply five-fold stratified cross-validation, early stopping (no improvement for 100 rounds), and regularization to prevent over-fitting. Three boosting algorithms are benchmarked with the following configurations:

Model Demo AUC Demo F1F_1 Full AUC Full F1F_1
CatBoost 0.9520 0.8477 0.9648 0.9479
LightGBM 0.9513 0.8460 0.9654 0.9494
XGBoost 0.9420 0.8283 0.9645 0.9491

"AUC" is defined as AUC=01TPR(FPR1(t))dtAUC = \int_0^1 TPR(FPR^{-1}(t))\,dt, and balanced F1F_1 as F1=2PrecisionRecallPrecision+RecallF_1 = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}.

Inclusion of alternative features (behavioral block) increases AUC by ~1.3 points and F1F_1 by ~10 points across all models (e.g., ΔAUC(CatBoost)=0.96480.9520=0.0128\Delta AUC_{(\mathrm{CatBoost})} = 0.9648 - 0.9520 = 0.0128, ΔF1(CatBoost)=0.94790.8477=0.1002\Delta F_1{(\mathrm{CatBoost})} = 0.9479 - 0.8477 = 0.1002). Paired DeLong tests confirm these improvements are statistically significant (p<0.001)(p<0.001).

6. Application and Significance

The Istanbul 2025 Q1 Synthetic Dataset provides lenders, regulators, and researchers with a transparent evidence base for assessing the risk discrimination attainable with privacy-respecting, non-bureau attributes. Empirical results suggest that a concise alternative data block approaches conventional bureau-level separation power in identifying consumer credit risk, thereby offering practical avenues to extend formal credit access among the underbanked in contexts where bureau histories are limited or unavailable (Denknalbant et al., 14 Dec 2025).

A plausible implication is that this pipeline enables the construction of risk models that are both fair (by population representativeness) and safe (by privacy and economic plausibility), directly supporting regulatory ambitions to address financial exclusion.

7. Limitations and Prospects

The dataset simulates observed marginals and behavior distributions, but the synthetic nature of records and the use of hybrid labeling rules mean that performance metrics may not transfer directly to real operational lending. The constructed delinquency labels embed assumptions about behavioral correlates of risk, potentially limiting generalizability to domains with different economic or cultural contexts. Further work could validate these results on acquisition portfolios or with supplementary alternative data sources.

In summary, the Istanbul 2025 Q1 Synthetic Dataset establishes rigorous precedents for open synthetic data generation, validation, and credit modeling among underbanked populations, and provides a platform for evaluating the effectiveness of alternative data-driven approaches to credit inclusion (Denknalbant et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Istanbul 2025 Q1 Synthetic Dataset.