Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs (1803.04337v3)

Published 12 Mar 2018 in cs.CV

Abstract: Replication studies are essential for validation of new methods, and are crucial to maintain the high standards of scientific publications, and to use the results in practice. We have attempted to replicate the main method in 'Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs' published in JAMA 2016; 316(22). We re-implemented the method since the source code is not available, and we used publicly available data sets. The original study used non-public fundus images from EyePACS and three hospitals in India for training. We used a different EyePACS data set from Kaggle. The original study used the benchmark data set Messidor-2 to evaluate the algorithm's performance. We used the same data set. In the original study, ophthalmologists re-graded all images for diabetic retinopathy, macular edema, and image gradability. There was one diabetic retinopathy grade per image for our data sets, and we assessed image gradability ourselves. Hyper-parameter settings were not described in the original study. But some of these were later published. We were not able to replicate the original study. Our algorithm's area under the receiver operating curve (AUC) of 0.94 on the Kaggle EyePACS test set and 0.80 on Messidor-2 did not come close to the reported AUC of 0.99 in the original study. This may be caused by the use of a single grade per image, different data, or different not described hyper-parameter settings. This study shows the challenges of replicating deep learning, and the need for more replication studies to validate deep learning methods, especially for medical image analysis. Our source code and instructions are available at: https://github.com/mikevoets/jama16-retina-replication

Citations (168)

View on Semantic Scholar

Summary

The paper attempts to replicate a deep learning model for diabetic retinopathy detection but reports significantly lower performance (AUC 0.94/0.80 vs 0.99) compared to the original study.
Discrepancies in replication performance are attributed to factors like proprietary vs public datasets, undisclosed hyper-parameters, and differences in image normalization techniques.
This replication study underscores the critical need for comprehensive methodological transparency and data accessibility to ensure reproducibility in deep learning research, especially for clinical applications.

Analysis of Replication Study: Deep Learning for Diabetic Retinopathy Detection

In this replication paper, Voets, Møllersen, and Bongo present an effort to validate the results of a landmark paper on the development of a deep learning algorithm for detecting diabetic retinopathy (DR) from retinal fundus images. The original paper, conducted by Gulshan et al., illustrated the use of convolutional neural networks (CNNs) in medical diagnostics, achieving high performance metrics based on proprietary datasets. However, the replication attempt reveals significant challenges inherent in reproducing deep learning results, particularly within medical image analysis.

The original paper deployed a deep learning algorithm using a CNN architecture capable of detecting referable diabetic retinopathy (rDR) in retinal images. This algorithm was trained on a large dataset (118,419 images) and validated on two test sets, EyePACS and Messidor-2, achieving an AUC of 0.99 in both. The performance metrics emphasized a fine balance between sensitivity and specificity, demonstrating the feasibility of deep learning in clinical evaluation of DR.

Conversely, the replication paper reported an appreciable reduction in performance, with an AUC of 0.94 on the Kaggle EyePACS set and 0.80 on Messidor-2. The discrepancies outlined in the replication paper highlight the obstacles in replicating results due to the absence of key methodological details, including hyper-parameter settings, image normalization procedures, and grading protocols. Additionally, the discrepancies could also stem from differences in the datasets used, as the replication paper relied on publicly accessible data from Kaggle and Messidor-2.

The root causes of the performance gap are multifaceted:

Data Disparity: The original paper utilized a proprietary dataset with multi-grade annotations for each image, while the replication paper employed a public dataset with single-grade annotations.
Hyper-parameter Indeterminacy: Hyper-parameters critical to neural network training were not fully disclosed in the original paper, potentially leading to suboptimal training in the replication attempt.
Normalization Methods: The replication explored various image normalization techniques, concluding that normalizing images to a [-1, 1] range was most effective, yet still insufficient to match the original paper's results.

The inability to replicate the original paper's findings can have widespread implications for reproducibility in deep learning research, emphasizing the necessity for comprehensive methodological transparency. The paper urges the availability of both datasets and source code to offer researchers the means to conduct robust validations and enhancements.

Practically, the findings suggest caution for stakeholders relying on published results for clinical applications, as methodological opacity and data accessibility can undermine the replicability needed for such applications. Theoretically, this paper contributes to ongoing discussions about the importance of robust experimental documentation, which is essential for replicability, especially in fields where algorithms have critical real-world implications.

Looking ahead, further replication studies should focus on overcoming these challenges by promoting open access to training data and implementation details. Such endeavors will not only ensure higher reproducibility but also encourage collaborative efforts in improving algorithmic performance critical for AI applications in healthcare.

This replication effort exemplifies the complexities in deep learning paper verification and underscores the necessity for a collaborative approach in the pursuit of scientific validity and methodological rigor.

Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs (1803.04337v3)

Summary

Analysis of Replication Study: Deep Learning for Diabetic Retinopathy Detection

Related Papers