Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy (1710.01711v3)

Published 4 Oct 2017 in cs.CV

Abstract: Diabetic retinopathy (DR) and diabetic macular edema are common complications of diabetes which can lead to vision loss. The grading of DR is a fairly complex process that requires the detection of fine features such as microaneurysms, intraretinal hemorrhages, and intraretinal microvascular abnormalities. Because of this, there can be a fair amount of grader variability. There are different methods of obtaining the reference standard and resolving disagreements between graders, and while it is usually accepted that adjudication until full consensus will yield the best reference standard, the difference between various methods of resolving disagreements has not been examined extensively. In this study, we examine the variability in different methods of grading, definitions of reference standards, and their effects on building deep learning models for the detection of diabetic eye disease. We find that a small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. board-certified ophthalmologists and retinal specialists.

Authors (8)

Jonathan Krause (14 papers)
Varun Gulshan (9 papers)
Ehsan Rahimy (1 paper)
Peter Karth (1 paper)
Kasumi Widner (2 papers)
Lily Peng (17 papers)
Greg S. Corrado (37 papers)
Dale R. Webster (20 papers)

Citations (419)

View on Semantic Scholar

Summary

Analyzing Grader Variability and Reference Standards in Machine Learning for Diabetic Retinopathy Detection

The paper presents a detailed paper on the variability in grading diabetic retinopathy (DR) and the significance of reference standards in evaluating machine learning models designed for this purpose. As one of the leading causes of vision loss globally, accurate detection of DR is crucial. The paper provides an analysis of different grading methods and consensus protocols, highlighting their influence on the performance of deep learning algorithms for DR detection.

The complex process of DR grading involves detecting minute features like microaneurysms and other retinal abnormalities. Previous research shows notable intergrader variability, emphasizing the need for consistent and reliable reference standards. Traditionally, a consensus among graders is achieved through a majority decision or an adjudication process. The paper emphasizes that adjudication, involving full consensus among specialists, offers a robust reference standard.

Methodological Approach

The research utilized a comprehensive dataset of retinal images obtained from multiple clinical sites, including EyePACS-affiliated clinics and notable eye hospitals in India. Images were graded under various protocols—by EyePACS-certified graders, ophthalmologists, and through an adjudicated consensus involving specialist retinal graders. Training the deep learning model involved segregating samples into training and tuning sets, emphasizing the importance of high-quality labels for effective model learning and validation.

The paper enhanced a convolutional neural network (CNN) architecture for DR detection, extending previous work by Gulshan et al. The CNN was trained to predict a five-point severity scale for DR, demonstrating high adaptability through improvements like increased input resolution and advanced data augmentation techniques. An ensemble approach, combining predictions from multiple network iterations, further optimized the model accuracy.

Results and Comparison

A significant finding of this paper is the performance improvement when using adjudicated consensus as the ground truth. The model demonstrated proficiency comparable to that of U.S. board-certified ophthalmologists and retinal specialists. For moderate or worse DR, the area under the curve (AUC) spiked from 0.942 to 0.986 when adjudicated consensus was applied. Moreover, higher resolution input images consistently improved model accuracy, demonstrating marginal gains above 450 x 450 pixels.

The comparison also shed light on algorithmic preferences in grading, where automated methods tended toward overgrading referable DR compared to human graders. This suggests potential integration strategies involving machine learning models as a primary screening tool followed by human verification for false positives, aiming for a balance in sensitivity and specificity.

Implications and Future Directions

The implications of this research are twofold: it highlights the criticality of accurate reference standards in training and validating deep learning models in medical imaging and signals a developmental pathway for AI systems intended for clinical applications. The paper offers insight into refining ground truth determination processes, which is essential in improving machine learning model performance for classifying medical images.

Future research will benefit from exploring the inclusion of additional imaging modalities, such as OCT, integrating data from diverse demographic and geographical populations, and evaluating the impact of using adjudicated consensus grades directly in the large-scale training dataset. Enhancing model generalizability and establishing robust, cost-efficient grading protocols remain key challenges for advancing this field.

In conclusion, the paper provides a structured pathway to improve diabetic retinopathy detection through machine learning by refining grading processes and employing specialist consensus for reference standards. This foundation will be imperative for developing future AI-driven healthcare solutions.

PDF Markdown