Comparative Analysis of Stroke Prediction Models Using Machine Learning (2505.09812v1)

Published 14 May 2025 in cs.LG

Abstract: Stroke remains one of the most critical global health challenges, ranking as the second leading cause of death and the third leading cause of disability worldwide. This study explores the effectiveness of machine learning algorithms in predicting stroke risk using demographic, clinical, and lifestyle data from the Stroke Prediction Dataset. By addressing key methodological challenges such as class imbalance and missing data, we evaluated the performance of multiple models, including Logistic Regression, Random Forest, and XGBoost. Our results demonstrate that while these models achieve high accuracy, sensitivity remains a limiting factor for real-world clinical applications. In addition, we identify the most influential predictive features and propose strategies to improve machine learning-based stroke prediction. These findings contribute to the development of more reliable and interpretable models for the early assessment of stroke risk.

Summary

Comparative Analysis of Stroke Prediction Models Using Machine Learning

The paper "Comparative Analysis of Stroke Prediction Models Using Machine Learning" presents a thorough evaluation of ML algorithms in forecasting stroke risk based upon demographic, clinical, and lifestyle data. It leverages a dataset with significant class imbalance related to stroke events, addressing key methodological challenges that could influence model accuracy and sensitivity in clinical applications.

Evaluation of Machine Learning Models

The authors systematically evaluate several machine learning models including Logistic Regression, Random Forest, Decision Tree, Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost), using a dataset comprising 5,110 records with 12 attributes. Class imbalance is a notable challenge, as only 4.87% of the instances represent positive stroke cases. This imbalance necessitates the use of techniques such as Oversampling, Undersampling, and SMOTE to enhance model sensitivity without compromising accuracy.

Data Handling and Preprocessing

The dataset presented typical challenges found in healthcare data, such as class imbalance and missing values. The paper employs an Iterative Imputer using Random Forest Regressor for accurately predicting missing BMI values, ensuring minimal data loss which is critical due to the limited number of positive cases. Categorical variables are encoded to facilitate ML model processing, with binary, ordinal, and label encoding applied based on the characteristics of the data.

Model Optimization and Performance

Hyperparameter tuning is performed using RandomizedSearchCV to identify optimal configurations for each model, ensuring maximum predictive performance. Results depicted varied outcomes across different sampling techniques:

Oversampling: SVM and Random Forest produced the highest accuracy (99.28% and 99.02% respectively).
SMOTE: Achieved slightly lower accuracies, yet maintained solid performances across models.
Undersampling: Evidenced substantial accuracy reductions, with Random Forest reaching only 74% and XGBoost 83%.

The research highlights ensemble models (Random Forest and XGBoost) as more effective under class imbalance scenarios, providing insights into their suitability for clinical deployment.

Feature Importance Analysis

Feature importance was analyzed, indicating that age, average glucose level, and BMI are consistently the most significant factors influencing stroke prediction. In models restricted to elderly patients (aged 65-80), work type and glucose level emerged as critical, whereas heart disease and hypertension increased in significance. SHAP analysis supports these findings, emphasizing a shift in feature importance in age-specific datasets, thus confirming the necessity for tailored predictive models based on demographic considerations.

Implications and Future Directions

The paper contributes significantly toward enhancing stroke prediction by machine learning through a comprehensive analysis of models and feature significance. It suggests that while high accuracy is achievable, model sensitivity remains a pivotal challenge, particularly given clinical applications' demand for reliable and interpretable outcomes.

Looking forward, this paper suggests exploring more complex datasets and further refining models to address the variability and interpretability issues. The research hints at the potential integration of such predictive models into clinical decision-making systems, offering a valuable tool for early intervention planning and personalized stroke prevention strategies.

In conclusion, the findings underscore the importance of demographic-specific analysis and the need for continued research to translate these predictive capabilities into practical, real-world health interventions. By confronting existing challenges and refining methodologies, future work can improve upon early detection mechanisms, potentially reducing the global burden of stroke and enhancing patient outcomes.