Bangla Symptoms-Disease Dataset for AI Diagnostics

Updated 24 January 2026

The Bangla Symptoms-Disease Dataset is a comprehensive resource featuring 758 curated associations among 85 diseases and 172 symptoms from diverse clinical and literature sources.
It employs a binary matrix schema with one-hot encoding and rigorous expert validation to ensure high-quality, linguistically accurate data for machine learning applications.
Ensemble models using this dataset achieve up to 98% accuracy, demonstrating its strong potential for enhancing AI-driven diagnostic systems and multilingual biomedical research.

The Bangla Symptoms-Disease Dataset is a publicly available, rigorously curated corpus of symptom–disease associations specifically designed to advance machine learning–based disease prediction, clinical decision support, and health informatics for Bangla-speaking populations. Encompassing 758 unique symptom–disease relations, 85 diseases, and 172 symptoms, the dataset offers the first large-scale, structured resource of its kind in Bengali. It is constructed from a blend of clinical surveys, medical literature, and region-specific sources, supporting applications in AI-driven diagnostics and multilingual biomedical research (Zannat et al., 17 Jan 2026, Shafi et al., 16 Jun 2025).

1. Dataset Construction and Structure

The dataset consists of 758 curated associations between 172 unique symptoms and 85 distinct diseases, yielding an average of approximately 8.9 symptoms per disease and 4.4 diseases per symptom (Zannat et al., 17 Jan 2026). Sourcing integrates in-person surveys of Bangladeshi clinicians, scraping and manual curation from medical journals, health blogs, newspaper articles, and patient-support forums. Expert manual validation ensures clinical consistency and linguistic accuracy.

Data formatting adheres to a binary matrix schema: each CSV row encodes a single patient symptom profile as 172 binary (0/1) flags, with an additional column denoting the correct disease label in Bangla Unicode. The cleaned dataset contains:

Disease (Bangla)	Symptom_1 (e.g., Fever)	Symptom_2 (e.g., Cough)	…	Symptom_172
ডেঙ্গু	1	0	…	1

Missing values default to 0 (absence), with uniform Unicode normalization and manual correction of spelling and script artefacts. Feature encoding utilizes one-hot representations for symptoms and integer labels for diseases ( $\mathbf{x} \in \{0,1\}^{172}$ , $y \in \{0, ..., 84\}$ ). Preprocessing includes PCA-based dimensionality reduction to retain 90% variance, reducing effective dimension to approximately 50–60 (Zannat et al., 17 Jan 2026).

2. Data Sources, Validation, and Quality Assurance

Data sources blend primary clinical consultation and systematically curated secondary literature (Shafi et al., 16 Jun 2025). Literature review encompasses PubMed, Google Scholar, Web of Science, and public health datasets (e.g., WHO case reports). Strict inclusion criteria require corroboration in peer-reviewed sources. Excluded are anecdotal, unverified, or non-peer-reviewed entries.

Translation from English to Bangla employs initial automated machine translation (Google Translation API), followed by bilingual clinician review for terminology, spelling, and synonym harmonization. Clinical consistency and label accuracy are enforced by consensus adjudication, with final normalization to ICD-10 where applicable.

Quality control involves automatic binary value auditing, random expert review (Cohen’s $\kappa=0.91$ ), and benchmarking against English disease–symptom datasets. This process yields high inter-annotator agreement and ensures the resource's suitability for clinical and AI use cases (Shafi et al., 16 Jun 2025).

3. Statistical Profile and Disease-Symptom Distributions

The distribution of symptoms per disease and diseases per symptom is markedly heterogeneous. For example:

Symptoms per Disease	#Diseases
1–5	12
6–10	40
11–15	22
>15	11

Diseases per Symptom	#Symptoms
1	28
2–5	110
>5	34

Symptom-rich diseases such as Dengue (41 symptoms), Diabetes (25), and Dysentery (22) reflect both prevalence and clinical complexity within the Bangladeshi context. Region-specific and zoonotic conditions (e.g., Nipah virus, CCHF) are explicitly represented, ensuring epidemiological relevance (Shafi et al., 16 Jun 2025).

4. Mathematical Formalism and Machine Learning Methodologies

The dataset is modeled as a multiclass classification problem:

Input: $\mathbf{x} \in \mathbb{R}^{172}$ (binary-encoded symptoms)
Target: $y \in \{1,\dots,85\}$
Objective: Minimize categorical cross-entropy loss,

$\mathcal{L}(\theta) = -\sum_{i=1}^{N}\sum_{k=1}^{K} 1_{y^{(i)}=k}\log p_\theta(y=k|\mathbf{x}^{(i)})$

Supervised models trained include Perceptron, Logistic Regression, Naive Bayes, Decision Tree, K-Nearest Neighbors $(K=5)$ , Passive Aggressive Classifier, Random Forest (70 trees, max depth 50), and linear Support Vector Machines. Hyperparameters are optimized via 5-fold cross-validation on the training split, using standardized and PCA-reduced features (Zannat et al., 17 Jan 2026).

Ensemble meta-classification is implemented via two strategies:

Soft voting: Averaging per-class probabilities across constituent models,

$P_j(\mathbf{x}) = \frac{1}{M} \sum_{m=1}^{M} p_{m,j}(\mathbf{x})$

Hard voting: Majority class label from constituent model predictions,

$V_j(\mathbf{x}) = \sum_{m=1}^{M} 1_{\,\hat{y}_m(\mathbf{x}) = j}$

5. Empirical Performance and Benchmarking

Evaluation employs classification accuracy, macro/micro-averaged precision, recall, and $F_1$ metrics. 80% of records (607 samples) are used for model training, with the remaining 20% for test evaluation; all validation applies 5-fold cross-validation.

Model	Accuracy	Precision (macro)	Recall (macro)	F1 (macro)
Perceptron	0.97	0.93	0.92	0.92
Logistic Regression	0.97	0.96	0.95	0.95
Naive Bayes	0.94	0.91	0.90	0.90
Decision Tree	0.78	0.71	0.70	0.69
K-NN	0.96	0.93	0.91	0.91
Passive Aggressive	0.96	0.91	0.90	0.90
Random Forest	0.97	0.96	0.96	0.96
SVM (linear)	0.96	0.94	0.92	0.92
Soft/Hard Voting	0.98	0.97	0.97	0.97

Ensembles (both soft and hard voting) achieve the highest accuracy (98%), benefiting from the complementary strengths and variance reduction of their constituent models. A plausible implication is that further incremental gains may require either richer input modalities or more complex model architectures (Zannat et al., 17 Jan 2026).

6. Accessibility, Usage, and Recommended Applications

The dataset is publicly distributed under a CC-BY license on Mendeley Data (see DOI: 10.17632/rjgjh8hgrt.2), with CSV-formatted, UTF-8-encoded files compatible with Python pandas, R data.table, and other analytical frameworks. Immediate usage includes supervised disease prediction, benchmarking of multilingual medical-AI models, clinical decision support modules translating symptoms to likely diagnoses, and epidemiological surveillance tailored to Bangla-speaking regions (Zannat et al., 17 Jan 2026, Shafi et al., 16 Jun 2025).

A sample data-loading code snippet in Python:

import pandas as pd
df = pd.read_csv('bangla_symptom_disease.csv')
X = df.drop('Disease', axis=1).values  # shape (758, 172)
y = df['Disease'].astype('category').cat.codes  # 0…84

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

7. Limitations and Future Directions

While the dataset serves as a foundational benchmark for symptom-based disease prediction in Bangla, several limitations exist: absence of direct patient-level EHR data, potential source bias favoring well-documented diseases, underrepresentation of rare or hyper-local conditions, and lack of demographic stratification or rich metadata (e.g., symptom severity, course, or prevalence) (Shafi et al., 16 Jun 2025). The static nature precludes automatic updates in response to emerging diseases.

Future directions include expansion to additional diseases and symptoms, incorporation of structured clinical notes and multimodal features (e.g., laboratory values, imaging), the deployment of deep learning architectures (feed-forward nets, transformers), and the implementation of crowdsourcing pipelines for continuous dataset enrichment (Zannat et al., 17 Jan 2026). Integration with real-world EHR data and the addition of fields for symptom duration and severity are also explicitly recommended for subsequent versions.

References

"Bridging the Gap in Bangla Healthcare: Machine Learning Based Disease Prediction Using a Symptoms-Disease Dataset" (Zannat et al., 17 Jan 2026)
"A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy" (Shafi et al., 16 Jun 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Bridging the Gap in Bangla Healthcare: Machine Learning Based Disease Prediction Using a Symptoms-Disease Dataset (2026)

A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bangla Symptoms-Disease Dataset.