Papers
Topics
Authors
Recent
2000 character limit reached

Stroke Prediction Dataset (SPD)

Updated 8 December 2025
  • SPD is a public clinical dataset with 11–12 features and approximately 5,000 anonymized patient records used for binary stroke prediction tasks.
  • It employs advanced preprocessing, including imputation, outlier removal, and resampling techniques, to address severe class imbalance of about 5–8%.
  • The dataset underpins ensemble modeling and explainability methods like LIME and SHAP, facilitating robust stroke risk assessment in clinical ML research.

The Stroke Prediction Dataset (SPD) is a publicly available, tabular clinical dataset used extensively for developing and benchmarking machine learning models to predict stroke risk from structured demographic, lifestyle, and clinical variables. It features multiple releases with slight variations in sample size and preprocessing, but centers on 11–12 columns and between 4,900 and 5,110 anonymized patient records, capturing major non-imaging variables relevant to stroke epidemiology. SPD is distinguished by its pronounced class imbalance (≈5–8% positive stroke cases), motivating advanced resampling and modeling pipelines. Downstream tasks include binary stroke occurrence prediction and, in multimodal settings, integration with imaging or survival data for longitudinal risk or infarct segmentation tasks.

1. Dataset Structure and Composition

The standard version of SPD contains 11 features plus target, as outlined in several recent studies (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025). Variable domains fall into three main categories:

  • Demographics: Age (years), Gender (Male/Female)
  • Clinical History: Hypertension (binary), Heart Disease (binary), Ever Married (binary)
  • Lifestyle and Sociodemographics: Work Type (Govt., Private, Self-employed, Children), Residence Type (Urban/Rural), Smoking Status (formerly/current/never)
  • Continuous Clinical Labs: Average Glucose Level (mg/dL), BMI (kg/m²)
  • Target: Stroke occurrence (0 = no, 1 = yes)

Class imbalance is severe: stroke-positive ratios range from 4.87% (249/5110) (Tashkova et al., 14 May 2025) to approximately 8.1% (412/5110) (Akib et al., 1 Dec 2025). No imaging data is included; variants focus on EHR-derived and low-cost tabular risk factors.

A compact representation of SPD's principal features is:

Name Type Description
age Numerical Age in years
gender Categorical Male or Female
hypertension Binary Clinical hypertension (0/1)
heart_disease Binary Clinical heart disease (0/1)
ever_married Binary Ever married (0/1 or yes/no)
work_type Categorical Govt., Private, Self-employed, etc
Residence_type Categorical Urban or Rural
avg_glucose_level Numerical Plasma glucose (mg/dL)
bmi Numerical Body Mass Index (kg/m²)
smoking_status Categorical Never, Formerly, Current, Unknown
stroke Binary Target: stroke/no stroke

Sample size after preprocessing (e.g., outlier removal via IQR) varies (4,337–5,110 records) (Islam et al., 18 May 2025, Akib et al., 1 Dec 2025).

2. Data Preprocessing and Quality Control

Preprocessing encompasses several stages:

3. Feature Selection, Exploratory Data Analysis, and Importance Estimation

Feature selection leverages both correlation-based and model-based methods:

  • Correlation Analysis: Point-biserial coefficients identify features most strongly associated with stroke (e.g., age, hypertension, ever_married, heart_disease) (Islam et al., 18 May 2025). Features with |rpbr_{pb}| ≥ 0.02 are typically retained.
  • Gini Importance: Random Forest Gini impurity decrease identifies the most predictive variables; average_glucose_level, BMI, and age consistently rank highest, with secondary contributions from smoking_status, work_type, and gender (Islam et al., 18 May 2025).
  • Dimensionality Reduction: Variance-thresholding or PCA mitigates multicollinearity (Akib et al., 1 Dec 2025).
  • Global Rankings: Ensemble interpretability (via LIME, SHAP, or tree gain scores) converges on age, glucose, and hypertension as the dominant predictors (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025).
  • EDA Outputs: SPD numerical features show moderate skew and kurtosis only for glucose (skew ≈ +1.59), negligible for age and BMI. Categorical frequencies reflect population-level distributions (e.g., gender ≈58% female, ever_married ≈66% yes, work_type ≈57% private sector) (Islam et al., 18 May 2025).

4. Modeling Protocols and Evaluation

Ensemble and hybrid architectures are central due to class imbalance and feature redundancy:

  • Base Models: Random Forest (RF), ExtraTrees (ET), XGBoost (XGB), Logistic Regression (LR), LightGBM, SVC, and multilayer neural networks are optimized with stratified 5-fold cross-validation (Akib et al., 1 Dec 2025, Islam et al., 18 May 2025).
  • Ensembling:
    • Soft Voting: Probabilistic outputs from RF, ET, and XGB are weighted by cross-validation rank; final probabilities computed as pensemble(cx)=mwmpm(cx)p_\text{ensemble}(c|x) = \sum_{m} w_m\,p_m(c|x), with thresholds optimized for F1-score (Akib et al., 1 Dec 2025).
    • Stacking: Diverse base models (e.g., RF, XGB, SVC) are combined via logistic regression meta-learner (Islam et al., 18 May 2025).
  • Resampling: Models are trained on ROS- or SMOTE-balanced datasets at each CV fold, with train/test splits respecting original class ratios (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025).
  • Performance Metrics: Accuracy, Precision, Recall, F1-score, and ROC-AUC are universally applied. For segmentation tasks (ISLES'24), Dice coefficient, absolute volume difference, lesion-wise F1, and absolute lesion count difference are standard (Rosa et al., 20 Aug 2024).

Table: Representative Ensemble Results on SPD (Akib et al., 1 Dec 2025, Tashkova et al., 14 May 2025, Islam et al., 18 May 2025)

Model Accuracy (%) Precision (%) Recall (%) F1-score (%) AUC (%)
RF + ET + XGB Ensemble 99.09 98.22 100.00 99.10 100.00
Random Forest 98.51–99.02 98.53–99.7 99.7–100 98.53–99.7 100
XGBoost 91.87–97.63 ~95.4 ~87.5–97.7 ~90.1–97.7 99.91
Hybrid Meta-Learner 97.2 97.15

5. Explainability and Variable Attribution

Interpretability is addressed via local and global explainability techniques:

  • LIME: For each test instance, ~5,000 perturbed samples are generated; a sparse linear surrogate is fit to derive local feature importance. Aggregated absolute LIME weights yield global feature rankings: age (≈0.47), hypertension (≈0.36), glucose (≈0.21) (Akib et al., 1 Dec 2025).
  • SHAP: Tree ensemble SHAP values identify similar top predictors, with gain-based measures highlighting age, glucose, and BMI (Tashkova et al., 14 May 2025).
  • Partial Dependence: For patients above age 50–55, risk score increases are quantified (12–15% increase per decade post-50) (Akib et al., 1 Dec 2025). Hypertension increases local log-odds by ~0.30–0.35, and glucose above 140 mg/dL adds ~0.15–0.20 to predicted risk.
  • Clinical Context: Age, hypertension, and glycemic control are robustly justified as primary targets for early intervention.

6. Access, Benchmark Cohorts, and Research Extensions

7. Prospects and Future Directions

Current research trajectories target:

  • Acquisition of multicenter, heterogeneous cohorts for external validation and recalibration of SPD-trained models (Akib et al., 1 Dec 2025).
  • Integration of deep or unsupervised feature extraction (e.g., autoencoder embeddings) for improved nonlinear pattern capture.
  • Cloud deployment with accelerated XAI (e.g., precomputed LIME) to reduce inference latency.
  • Transition to longitudinal modeling (e.g., survival forests, Cox proportional-hazards models), extending SPD's utility to time-to-event tasks with interpretable risk trajectories.
  • Stacking ensembles, age-stratified models, and cost-sensitive learning are recommended to enhance sensitivity, reduce minority class bias, and facilitate clinical translation (Tashkova et al., 14 May 2025).

Collectively, SPD remains an indispensable, openly accessible foundation for stroke risk modeling in clinical ML research, with continued relevance as a benchmark for algorithmic innovation and reproducibility.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Stroke Prediction Dataset (SPD).