Heart Disease Health Indicators Dataset
- The Heart Disease Health Indicators Dataset is a collection of large-scale, curated records from public health surveys, clinical registries, and engineered feature sets.
- It integrates clinical, behavioral, and demographic predictors such as blood pressure, BMI, and lifestyle factors to enable comprehensive cardiovascular risk analysis.
- The dataset drives advanced predictive modeling through ensemble methods, feature importance evaluation with SHAP, and tailored preprocessing strategies.
The Heart Disease Health Indicators Dataset encompasses a variety of large-scale observational and curated datasets designed to support cardiovascular disease (CVD) risk modeling, early diagnosis, and population health analytics. Its historical evolution reflects increasing dataset size, feature diversity, and alignment with real-world screening or clinical use. While no single canonical dataset exists, the dominant variants—including BRFSS-derived surveys, UCI/“Cleveland”-style clinical cohorts, and recent international or regional registries—share a focus on standardized, reproducible health indicators predictive of CVD risk in both population and clinical settings.
1. Dataset Structure and Sources
The Heart Disease Health Indicators Dataset typically refers to large-scale, cross-sectional cohorts constructed from public health surveillance systems or clinical registries. The most prominent instance is the 2015 BRFSS-derived dataset (Kaggle, Alex Teboul), which provides 229,781 adult records, each annotated with up to 25 features after feature engineering (Hasnat et al., 3 Nov 2025).
Key dataset variants include:
- BRFSS/Health Indicators Surveys: Cross-sectional US population, self-reported clinical and behavioral history, e.g., hypertension, smoking, physical inactivity.
- Kaggle/CDC 100K Dataset: 100,000 records, 22 variables (features include HeartDiseaseorAttack, HighBP, HighChol, BMI, Smoker, PhysActivity), with 9.3% positive class prevalence (Karmakar et al., 26 Jul 2024).
- International Clinical Registries: E.g., Bangladesh government hospital/diagnostic center datasets with multiclass CVD annotation (Haque et al., 6 Dec 2024).
- Composite Cohorts: Aggregations from Cleveland, Hungarian, Long Beach VA, Switzerland, Statlog datasets for model benchmarking.
Each dataset encodes one or several forms of the heart disease target: binary “disease or not”, disease subtype, or continuous risk.
2. Feature Set: Clinical, Behavioral, Demographic, and Engineered Predictors
Core feature groups are:
- Clinical (“objective” or “diagnosis-based”): Blood pressure, serum cholesterol, BMI, diabetes status, chronic disease history (Stroke, Asthma, KidneyDisease).
- Behavioral: Smoking status, physical activity, alcohol consumption, sleep time, diet (Fruits, Veggies).
- Demographic: Sex, age (continuous or in bins), income, education, race, age category.
- Self-rated health: GenHealth (1=excellent to 5=poor), MentalHealth, PhysicalHealth (days affected in past 30).
- Engineered features: BMI category, Health_Risk_Score (additive sum of hypertension, hypercholesterolemia, diabetes), BMI_BP_Interaction (product of BMI and hypertension indicator) (Hasnat et al., 3 Nov 2025).
The exact features and encoding depend on dataset, but typically mixed binary, ordinal, and continuous variables are present. Table 1 illustrates a representative subset from (Hasnat et al., 3 Nov 2025) and (Karmakar et al., 26 Jul 2024):
| Feature | Type | Coding / Range |
|---|---|---|
| HeartDiseaseorAttack | Binary | 0, 1 |
| HighBP, HighChol | Binary | 0, 1 |
| BMI | Continuous | 10–60 kg/m² |
| Smoker, Stroke | Binary | 0, 1 |
| PhysActivity | Binary | 0, 1 |
| GenHlth | Ordinal | 1 (best) – 5 (worst) |
| Age | Continuous | Years (continuous); sometimes categorical bins |
| Sex | Binary | 0 = female, 1 = male |
| Health_Risk_Score | Integer | 0–3 (sum of risk factor indicators) |
| BMI_BP_Interaction | Continuous | Product of BMI and HighBP |
This table is not exhaustive and columns may vary with dataset version.
3. Data Preprocessing and Class Imbalance Strategies
Preprocessing steps are dictated by scale, missingness, and modeling objectives:
- Missing values: Median imputation for continuous; mode for categorical fields.
- Encoding: One-hot encoding of nominal/ordinal categorical fields (e.g., AgeCategory, Race, GenHealth); binary fields retained as 0/1.
- Normalization/scaling: Continuous variables standardized (zero mean, unit variance) for all non-tree models and neural networks.
- Outlier management: Extreme outliers excluded via clinical plausibility filters or IQR-based rules. For example, implausible values of diastolic > systolic blood pressure or extreme BMI outliers are removed (Ramesh et al., 29 Jul 2025).
- Feature engineering: Generation of interaction and summary features such as Health_Risk_Score and BMI_BP_Interaction, which demonstrated high predictive utility in ensemble models (Hasnat et al., 3 Nov 2025).
- Class imbalance: Addressed via:
- Strategic inverse-frequency class weighting in loss functions (preferred for large datasets as in (Hasnat et al., 3 Nov 2025)).
- Synthetic oversampling approaches (K-means SMOTE for some analyses (Karmakar et al., 26 Jul 2024)), but synthetic sampling is explicitly avoided for certain clinical deployment studies to preserve “real” data structure (Hasnat et al., 3 Nov 2025).
- Train/Test splits: Standard 80/20 or 10-fold cross-validation stratified by outcome prevalence.
4. Predictive Modeling Approaches and Evaluation
A wide array of machine learning algorithms have been applied, with a contemporary emphasis on tree-based and ensemble methods:
- Ensemble Models: Weighted combinations of LightGBM, XGBoost, and CNN; ensemble weights determined via grid/Bayesian search for maximal validation AUC (Hasnat et al., 3 Nov 2025). Strategic weighting is key for maximizing recall in screening use-cases (e.g., 0.7/0.2/0.1 for LGBM/XGB/CNN).
- Random Forest, Decision Tree: Direct application to tabular data; achieves high performance with balanced or weighted data (Karmakar et al., 26 Jul 2024).
- Neural Architectures: 1D CNNs and RNNs for feature-sequence modeling. CNNs contribute additional predictive signal in ensemble frameworks but perform suboptimally alone for tabular CVD data.
- Logistic Regression, SVM: Serve as reference/baseline models; outperformed by advanced ensemble methods.
- Feature Selection: Chi-square, Correlation matrix, Sequential Forward/Backward selection—showed that hypertension, cholesterol, smoking, and physical inactivity are repeatably top predictors.
- Model evaluation: AUC (Area Under Receiver Operating Characteristic, up to 0.8371 (Hasnat et al., 3 Nov 2025)), recall (up to 80.0%), and F1-score (37.6%). Statistical significance of ensemble improvement over baselines assessed via bootstrap p-values.
Standard metric definitions (as per (Hasnat et al., 3 Nov 2025, Karmakar et al., 26 Jul 2024)):
- : Empirical ROC integral
5. Interpretation and Feature Importance
Interpretability is emphasized through surrogate modeling and quantitative feature attribution:
- SHAP (SHapley Additive exPlanations): Global and local SHAP values identify Age, self-rated General Health, Health_Risk_Score, and BMI_BP_Interaction as the highest-impact features (Hasnat et al., 3 Nov 2025). The SHAP decomposition formalism is:
where represents the contribution of the th feature.
- Surrogate Decision Trees: Trained to mimic complex ensemble outputs (e.g., max_depth=4 tree matching 89.9% of ensemble predictions), enhancing explainability for clinicians. Decision boundaries often prioritize engineered interaction terms and age.
- Empirical Consistency: Top predictors (hypertension, hypercholesterolemia, smoking, physical inactivity) match established epidemiological risk factors (Karmakar et al., 26 Jul 2024). Engineered features, such as interaction terms between obesity and hypertension, frequently outperform single raw inputs.
6. Access, Practical Applications, and Limitations
- Data Accessibility: Full records for the BRFSS/Kaggle-derived dataset (v1.0, 2015) are public domain (CC0 1.0). BIG-Dataset (Bangladesh controls) is available via Kaggle, though the multiclass HDD records may require author request (Haque et al., 6 Dec 2024).
- Screening and Risk Stratification:
- Large‐scale screening: High recall configurations are recommended (e.g., ensemble’s 80% recall) to avoid missed cases (Hasnat et al., 3 Nov 2025).
- Clinical confirmatory use: Preference for models with higher F1-score and precision (e.g., LightGBM single model with precision 31.6%, F1-score 42.0%).
- Deployment Contexts: Population surveillance, real-time decision support, and EHR integration, with pipeline reproducibility from raw tabular entry to explainable output.
- Limitations:
- Self-reported data (e.g., BRFSS) subject to recall and reporting bias.
- Cross-sectional design limits temporal (incident) risk prediction.
- Synthetic oversampling (SMOTE/ADASYN) is avoided in clinical screening models due to risks of feature-space distortion (Hasnat et al., 3 Nov 2025).
- No ground-truth adjudication in survey-based datasets.
7. Extensions and Future Directions
Potential avenues for methodological improvement include:
- External Validation: Application to distinct EHR or clinical trial datasets for generalizability testing.
- Advanced Interaction Modeling: Deep learning on binarized symptom/risk matrices for earlier phenotype discovery (Haque et al., 6 Dec 2024).
- Explainability Innovation: Integration of additional XAI methods (counterfactuals, anchors, prototype-guided explanations) for enhancing clinician trust and transparency.
- Stratified Analysis: Subgroup modeling by sex, age, or ethnicity to confront population heterogeneity and reduce algorithmic bias.
- Longitudinal Integration: Temporal modeling with repeated measures to support incident risk prediction.
- Expanded Feature Set: Inclusion of emerging risk biomarkers and detailed behavioral trajectories to increase ecological validity.
The Heart Disease Health Indicators Dataset, in its current advanced incarnations, serves as a scalable, demographically detailed benchmark for methodological development, clinical validation, and translational implementation in CVD risk modeling and screening.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free