Papers
Topics
Authors
Recent
2000 character limit reached

Exploring large scale public medical image datasets (1907.12720v1)

Published 30 Jul 2019 in eess.IV, cs.CV, and cs.LG

Abstract: Rationale and Objectives: Medical artificial intelligence systems are dependent on well characterised large scale datasets. Recently released public datasets have been of great interest to the field, but pose specific challenges due to the disconnect they cause between data generation and data usage, potentially limiting the utility of these datasets. Materials and Methods: We visually explore two large public datasets, to determine how accurate the provided labels are and whether other subtle problems exist. The ChestXray14 dataset contains 112,120 frontal chest films, and the MURA dataset contains 40,561 upper limb radiographs. A subset of around 700 images from both datasets was reviewed by a board-certified radiologist, and the quality of the original labels was determined. Results: The ChestXray14 labels did not accurately reflect the visual content of the images, with positive predictive values mostly between 10% and 30% lower than the values presented in the original documentation. There were other significant problems, with examples of hidden stratification and label disambiguation failure. The MURA labels were more accurate, but the original normal/abnormal labels were inaccurate for the subset of cases with degenerative joint disease, with a sensitivity of 60% and a specificity of 82%. Conclusion: Visual inspection of images is a necessary component of understanding large image datasets. We recommend that teams producing public datasets should perform this important quality control procedure and include a thorough description of their findings, along with an explanation of the data generating procedures and labelling rules, in the documentation for their datasets.

Citations (162)

Summary

Analysis of Public Medical Image Datasets: Label Accuracy and Challenges

The paper by Luke Oakden-Rayner provides a meticulous examination of large-scale public medical image datasets, particularly focusing on the ChestXray14 and MURA datasets. The emphasis of this research lies in identifying and scrutinizing the accuracy and quality of the labels accompanying these datasets, which are paramount for the training of reliable medical AI systems. The paper reveals significant discrepancies and limitations in the dataset labels, underscoring the importance of thorough quality control and documentation in dataset development.

The examination of the ChestXray14 dataset, comprising 112,120 frontal chest radiographs, demonstrates substantial inconsistency between the labeled data and the visual content of the images. The positive predictive values (PPVs) of the labels were determined to be 10-30% lower than those documented in the original release. This discrepancy can be attributed to inaccuracies in the natural language processing techniques employed to derive labels from clinical reports, which often omit descriptions of certain image findings that are either deemed clinically irrelevant or already known to healthcare providers.

Further, label disambiguation failures, such as with the classification of "emphysema," predominantly confused with subcutaneous emphysema rather than pulmonary emphysema, highlight the limitations of the labeling methodologies. Similarly, the classification of airspace opacities, such as "consolidation" and "infiltration," proved to be arbitrary due to the lack of clear clinical differentiation. Additionally, the dataset demonstrated hidden stratification issues, notably in cases labeled as "pneumothorax," where the presence of chest drains incorrectly inflated PPVs.

On the other hand, the MURA dataset, consisting of 40,561 upper limb radiographs, exhibited higher label accuracy overall, though it still suffered from deficiencies, particularly related to degenerative joint disease. The sensitivity and specificity for cases involving degenerative changes were notably low, pointing to potential discrepancies in labeling practices whereby degenerative changes in older patients were possibly considered normal for age, leading to under-reporting.

The analysis conducted by Oakden-Rayner posits that these labeling issues stem from a disconnect between data generation and usage, where medical expertise is indispensable for accurate interpretation and labeling of medical images. The paper advocates for the integration of visual inspection by qualified medical professionals as a quality assurance measure, ensuring the reliability of datasets prior to release. Furthermore, the development process, including labeling rules and any inherent limitations, should be meticulously documented.

By addressing these concerns, the potential utility of such datasets for the development of clinically applicable AI systems can be significantly enhanced. Future directions may involve the creation of visually verified test subsets, which could offer a reliable benchmark for evaluating model performance independent of flawed label sets. The establishment of standardized labeling protocols and documentation procedures are also highlighted as essential practices that could facilitate the production and dissemination of high-quality medical datasets.

Ultimately, the paper underscores a crucial aspect of medical AI research: the integrity of datasets. Without accurate and well-characterized data, the advancement of AI systems in medical diagnostics and treatment is inherently limited. Therefore, a collaborative effort involving clinicians, data scientists, and AI researchers is necessary to overcome these challenges and unlock the full potential of AI in healthcare.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.