Diagnosis of diabetes using classification mining techniques

Published 12 Feb 2015 in cs.CE | (1502.03774v1)

Abstract: Diabetes has affected over 246 million people worldwide with a majority of them being women. According to the WHO report, by 2025 this number is expected to rise to over 380 million. The disease has been named the fifth deadliest disease in the United States with no imminent cure in sight. With the rise of information technology and its continued advent into the medical and healthcare sector, the cases of diabetes as well as their symptoms are well documented. This paper aims at finding solutions to diagnose the disease by analyzing the patterns found in the data through classification analysis by employing Decision Tree and Na\"ive Bayes algorithms. The research hopes to propose a quicker and more efficient technique of diagnosing the disease, leading to timely treatment of the patients.

Abstract PDF Upgrade to Chat

Citations (323)

View on Semantic Scholar

Summary

The paper demonstrates that Naïve Bayes achieves superior accuracy (79.57%) over Decision Trees in predicting diabetes from the Pima Indians Diabetes Database.
The study employs Decision Trees with the J48 algorithm and rigorous preprocessing, including feature selection and normalization, to enhance data consistency.
The findings highlight the practical potential of data mining techniques for early diabetes detection and suggest avenues for integrating advanced machine learning models.

Classification Mining Techniques for Diabetes Diagnosis

The paper, "Diagnosis of Diabetes Using Classification Mining Techniques," explores the potential of utilizing classification algorithms, specifically Decision Trees and Naïve Bayes, to anticipate and diagnose diabetes in pregnant women. The study focuses on the Pima Indians Diabetes Database to discern patterns indicative of diabetes presence. The impetus for the research is the looming public health challenge posed by diabetes, which the World Health Organization estimates will afflict over 380 million individuals globally by 2025. The paper positions itself within the context of enhancing early diagnosis protocols, particularly for women, who experience more severe impacts from the disease.

Methodological Foundation

The study leverages two principal data mining techniques: Decision Trees and Naïve Bayes, to classify data and predict diabetes. Decision Trees, implemented using the J48 algorithm, represent data in a tree structure where attributes are selected based on the computation of highest Information Gain (IG). Naïve Bayes, in contrast, applies a probabilistic approach based on Bayes' theorem, enabling a classification that considers the probability of features given the class. These techniques are selected for their interpretability and efficacy in handling large datasets as evidenced by previous research paradigms.

The dataset employed is the Pima Indians Diabetes Database, which contains eight attributes relevant to diabetes prediction, including plasma glucose concentration and body mass index. Notably, preprocessing steps ensure data consistency through normalization and replacement of missing data values. Feature selection is further refined using the CfsSubsetEval algorithm, distilling essential attributes that contribute to a higher accuracy in prediction.

Results and Analysis

The paper presents a comparative analysis of the two models using two datasets splitting techniques: 10-fold cross-validation and a 70:30 percentage split. For the J48 Decision Tree, the cross-validation results show a classification accuracy of approximately 74.87%, whereas a 76.96% accuracy is achieved under the percentage split. The Naïve Bayes model, evaluated solely with a 70:30 split, achieves a superior accuracy of 79.57%. These metrics, alongside Kappa statistics and error rates, underscore the effectiveness of both models but highlight Naïve Bayes as slightly more reliable in this context. The confusion matrices provided detail the classification performance, notably in terms of false positives and false negatives, offering insights into potential areas for algorithmic refinement.

Implications and Future Directions

The implications of this study are both practical and theoretical; practically, the presented models provide a foundation for diabetes diagnosis in clinical settings, potentially enhancing early detection efforts. Theoretically, the exploration reinforces the viability of classification algorithms in medical data mining, suggesting areas for further enhancement such as integration with more sophisticated models or hybrid approaches that amalgamate the strengths of various algorithms.

Future research could expand upon this work by incorporating more diverse datasets from various demographics to enhance model generalizability. Moreover, adopting recent advances in machine learning, such as ensemble methods or deep learning techniques, might offer improved predictive capabilities. In particular, integrating temporal data could aid in capturing the progression of disease states, offering enriched predictive insights.

In conclusion, this paper contributes to the growing body of literature affirming the utility of data mining techniques in health informatics, specifically in diagnosing diabetes. By elucidating the methodological nuances and performance outcomes of Decision Trees and Naïve Bayes, it lays the groundwork for future advancements in predictive analytics within medical domains.

Markdown