Stroke Prediction Dataset (SPD)

Updated 8 December 2025

SPD is a public clinical dataset with 11–12 features and approximately 5,000 anonymized patient records used for binary stroke prediction tasks.
It employs advanced preprocessing, including imputation, outlier removal, and resampling techniques, to address severe class imbalance of about 5–8%.
The dataset underpins ensemble modeling and explainability methods like LIME and SHAP, facilitating robust stroke risk assessment in clinical ML research.

The Stroke Prediction Dataset (SPD) is a publicly available, tabular clinical dataset used extensively for developing and benchmarking machine learning models to predict stroke risk from structured demographic, lifestyle, and clinical variables. It features multiple releases with slight variations in sample size and preprocessing, but centers on 11–12 columns and between 4,900 and 5,110 anonymized patient records, capturing major non-imaging variables relevant to stroke epidemiology. SPD is distinguished by its pronounced class imbalance (≈5–8% positive stroke cases), motivating advanced resampling and modeling pipelines. Downstream tasks include binary stroke occurrence prediction and, in multimodal settings, integration with imaging or survival data for longitudinal risk or infarct segmentation tasks.

1. Dataset Structure and Composition

The standard version of SPD contains 11 features plus target, as outlined in several recent studies (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025). Variable domains fall into three main categories:

Demographics: Age (years), Gender (Male/Female)
Clinical History: Hypertension (binary), Heart Disease (binary), Ever Married (binary)
Lifestyle and Sociodemographics: Work Type (Govt., Private, Self-employed, Children), Residence Type (Urban/Rural), Smoking Status (formerly/current/never)
Continuous Clinical Labs: Average Glucose Level (mg/dL), BMI (kg/m²)
Target: Stroke occurrence (0 = no, 1 = yes)

Class imbalance is severe: stroke-positive ratios range from 4.87% (249/5110) (Tashkova et al., 14 May 2025) to approximately 8.1% (412/5110) (Akib et al., 1 Dec 2025). No imaging data is included; variants focus on EHR-derived and low-cost tabular risk factors.

A compact representation of SPD's principal features is:

Name	Type	Description
age	Numerical	Age in years
gender	Categorical	Male or Female
hypertension	Binary	Clinical hypertension (0/1)
heart_disease	Binary	Clinical heart disease (0/1)
ever_married	Binary	Ever married (0/1 or yes/no)
work_type	Categorical	Govt., Private, Self-employed, etc
Residence_type	Categorical	Urban or Rural
avg_glucose_level	Numerical	Plasma glucose (mg/dL)
bmi	Numerical	Body Mass Index (kg/m²)
smoking_status	Categorical	Never, Formerly, Current, Unknown
stroke	Binary	Target: stroke/no stroke

Sample size after preprocessing (e.g., outlier removal via IQR) varies (4,337–5,110 records) (Islam et al., 18 May 2025, Akib et al., 1 Dec 2025).

2. Data Preprocessing and Quality Control

Preprocessing encompasses several stages:

Missing Value Imputation: Numerical variables use mean or median imputation; categorical/binary variables use mode. Some studies employ KNN or iterative imputation for complex dependencies (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025). Notably, some releases report zero missing data in cleaned tables, while others impute BMI (3.93% missing) via Random Forest regression (Tashkova et al., 14 May 2025, Islam et al., 18 May 2025).
Outlier Removal: Standard IQR-based filtering excludes values outside 1.5×IQR; for example, avg_glucose_level outliers (12%) and BMI (0.8%) are removed, reducing N by ~13% (Islam et al., 18 May 2025).
Encoding and Scaling: Categorical variables (work type, smoking status) are one-hot or label-encoded; continuous features undergo standardization (zero mean, unit variance) (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025). Smoking status may be ordinal encoded according to semantic ordering (Tashkova et al., 14 May 2025).
Class Imbalance Correction: Multiple resampling strategies are used, including Random Over-Sampling (ROS), Synthetic Minority Over-Sampling Technique (SMOTE, with k=5 neighbors), and Random Undersampling (RUS). Some pipelines combine these with cost-sensitive learning by reweighting model losses (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025).

3. Feature Selection, Exploratory Data Analysis, and Importance Estimation

Feature selection leverages both correlation-based and model-based methods:

Correlation Analysis: Point-biserial coefficients identify features most strongly associated with stroke (e.g., age, hypertension, ever_married, heart_disease) (Islam et al., 18 May 2025). Features with | $r_{pb}$ | ≥ 0.02 are typically retained.
Gini Importance: Random Forest Gini impurity decrease identifies the most predictive variables; average_glucose_level, BMI, and age consistently rank highest, with secondary contributions from smoking_status, work_type, and gender (Islam et al., 18 May 2025).
Dimensionality Reduction: Variance-thresholding or PCA mitigates multicollinearity (Akib et al., 1 Dec 2025).
Global Rankings: Ensemble interpretability (via LIME, SHAP, or tree gain scores) converges on age, glucose, and hypertension as the dominant predictors (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025).
EDA Outputs: SPD numerical features show moderate skew and kurtosis only for glucose (skew ≈ +1.59), negligible for age and BMI. Categorical frequencies reflect population-level distributions (e.g., gender ≈58% female, ever_married ≈66% yes, work_type ≈57% private sector) (Islam et al., 18 May 2025).

4. Modeling Protocols and Evaluation

Ensemble and hybrid architectures are central due to class imbalance and feature redundancy:

Base Models: Random Forest (RF), ExtraTrees (ET), XGBoost (XGB), Logistic Regression (LR), LightGBM, SVC, and multilayer neural networks are optimized with stratified 5-fold cross-validation (Akib et al., 1 Dec 2025, Islam et al., 18 May 2025).
Ensembling:
- Soft Voting: Probabilistic outputs from RF, ET, and XGB are weighted by cross-validation rank; final probabilities computed as $p_\text{ensemble}(c|x) = \sum_{m} w_m\,p_m(c|x)$ , with thresholds optimized for F1-score (Akib et al., 1 Dec 2025).
- Stacking: Diverse base models (e.g., RF, XGB, SVC) are combined via logistic regression meta-learner (Islam et al., 18 May 2025).
Resampling: Models are trained on ROS- or SMOTE-balanced datasets at each CV fold, with train/test splits respecting original class ratios (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025).
Performance Metrics: Accuracy, Precision, Recall, F1-score, and ROC-AUC are universally applied. For segmentation tasks (ISLES'24), Dice coefficient, absolute volume difference, lesion-wise F1, and absolute lesion count difference are standard (Rosa et al., 20 Aug 2024).

Table: Representative Ensemble Results on SPD (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025)

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)	AUC (%)
RF + ET + XGB Ensemble	99.09	98.22	100.00	99.10	100.00
Random Forest	98.51–99.02	98.53–99.7	99.7–100	98.53–99.7	100
XGBoost	91.87–97.63	~95.4	~87.5–97.7	~90.1–97.7	99.91
Hybrid Meta-Learner	97.2	—	—	97.15	—

5. Explainability and Variable Attribution

Interpretability is addressed via local and global explainability techniques:

LIME: For each test instance, ~5,000 perturbed samples are generated; a sparse linear surrogate is fit to derive local feature importance. Aggregated absolute LIME weights yield global feature rankings: age (≈0.47), hypertension (≈0.36), glucose (≈0.21) (Akib et al., 1 Dec 2025).
SHAP: Tree ensemble SHAP values identify similar top predictors, with gain-based measures highlighting age, glucose, and BMI (Tashkova et al., 14 May 2025).
Partial Dependence: For patients above age 50–55, risk score increases are quantified (12–15% increase per decade post-50) (Akib et al., 1 Dec 2025). Hypertension increases local log-odds by ~0.30–0.35, and glucose above 140 mg/dL adds ~0.15–0.20 to predicted risk.
Clinical Context: Age, hypertension, and glycemic control are robustly justified as primary targets for early intervention.

6. Access, Benchmark Cohorts, and Research Extensions

Dataset Access: SPD is available at https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-brain-stroke-dataset, requiring only a free account (Islam et al., 18 May 2025).
Benchmarking: Variants of SPD are integrated in challenges such as ISLES'24 for multimodal final infarct prediction, where tabular SPD data complements imaging (CT, MR) for supervised infarct segmentation (Rosa et al., 20 Aug 2024).
Applications and Limitations:
- Applications: SPD supports development of EHR-integrated alert systems, clinical dashboards, and simulation of intervention strategies (Akib et al., 1 Dec 2025, Islam et al., 18 May 2025). Models exhibit near-perfect discrimination within SPD but remain to be externally validated.
- Limitations: Absence of imaging and some clinical labs, overfitting risk from over-sampling, and reduced real-world generalizability are cited. SPD under-represents certain stroke subtypes (e.g., posterior circulation, pediatrics) (Rosa et al., 20 Aug 2024).

7. Prospects and Future Directions

Current research trajectories target:

Acquisition of multicenter, heterogeneous cohorts for external validation and recalibration of SPD-trained models (Akib et al., 1 Dec 2025).
Integration of deep or unsupervised feature extraction (e.g., autoencoder embeddings) for improved nonlinear pattern capture.
Cloud deployment with accelerated XAI (e.g., precomputed LIME) to reduce inference latency.
Transition to longitudinal modeling (e.g., survival forests, Cox proportional-hazards models), extending SPD's utility to time-to-event tasks with interpretable risk trajectories.
Stacking ensembles, age-stratified models, and cost-sensitive learning are recommended to enhance sensitivity, reduce minority class bias, and facilitate clinical translation (Tashkova et al., 14 May 2025).

Collectively, SPD remains an indispensable, openly accessible foundation for stroke risk modeling in clinical ML research, with continued relevance as a benchmark for algorithmic innovation and reproducibility.