- The paper introduced a large-scale challenge with over 12,500 images and novel metrics like Thresholded Jaccard and balanced accuracy for benchmarking skin lesion analysis.
- The paper revealed that despite high average segmentation scores, over 10% of images were not correctly segmented, underscoring challenges with certain lesion types.
- The paper highlighted underperformance in lesion attribute detection, stressing the need for clearer clinical definitions and improvements in machine learning methodologies.
Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)
The 2018 ISIC Challenge, held at the MICCAI conference, represents a significant benchmarking effort in the domain of automated skin lesion analysis for melanoma detection. This summary discusses the methodologies, evaluation criteria, and results associated with the three main tasks in the challenge: lesion segmentation, lesion attribute detection, and disease classification.
Overview
The 2018 ISIC Challenge expanded the dataset and diagnostic labels from previous years, providing over 12,500 images for training and over 2,000 images for testing across three tasks. With 900 registered teams and 299 total submissions, this represents the largest challenge in this field to date.
Tasks and Evaluation Criteria
Lesion Segmentation: Task 1 involved segmenting lesions from dermoscopic images. The training set included 2,594 images with ground truth masks, while the validation and test sets included 100 and 1,000 images, respectively. The evaluation introduced "Thresholded Jaccard" to better account for interobserver variability, with a failure threshold set at Jaccard index < 0.65.
Lesion Attribute Detection: Task 2 focused on detecting attributes within lesions. Training data comprised 2,594 images with masks for 5 attributes. As some attributes may not be present in certain images, the evaluation metric was adapted to compute Jaccard over the entire dataset rather than per image.
Disease Classification: Task 3 involved classifying lesions into seven disease categories. The training set included 10,015 images, and the test sets were partitioned into internal (comprising data similar to training) and external (comprising data from new sources). Balanced accuracy was used to mitigate the impact of dataset prevalence on performance metrics.
Results and Findings
Lesion Segmentation: With 112 submissions, the highest performing segmentation algorithm achieved a Thresholded Jaccard of 0.802. Despite high average Jaccard scores (exceeding 0.8), many top methods failed to segment over 10% of images correctly. This indicates that even high-performing algorithms face significant challenges with certain types of lesions, particularly seborrheic keratoses.
Lesion Attribute Detection: Only 26 submissions were made to this task, with the top performing algorithm achieving an average Jaccard of 0.473. The low performance here could be attributed to the poor inter-observer reliability of clinical dermoscopic attributes. This suggests a need either for more robust clinical definitions of dermoscopic features or for using machine learning to refine these definitions.
Disease Classification: This task received 141 submissions, with the highest balanced accuracy at 0.885. Performance varied significantly on internal versus external test sets, highlighting issues with algorithm generalization. Notably, top-performing algorithms did not show significant overfitting to internal data sources, suggesting that robust approaches can generalize well across diverse datasets. Furthermore, balanced accuracy proved crucial in appropriately ranking algorithms, versus simpler measures such as raw accuracy or AUC, which can be misleading due to class imbalances.
Implications and Future Directions
The evaluation metrics and task designs introduced in the 2018 challenge have set a new standard for future challenges. First, the Thresholded Jaccard metric offers a more nuanced view of segmentation performance by severely penalizing extreme deviations. For classification, balanced accuracy and multiple test partitions provide deeper insights into the robustness and generalizability of algorithms.
Poor performance on attribute detection highlights an ongoing challenge in both clinical dermatology and machine learning. Further research must focus on improving the consistency of attribute definitions and exploring machine learning's role in achieving this.
From a regulatory perspective, these findings are critical. As machine learning tools continue to grow in healthcare, understanding both the strengths and limitations of existing algorithms is essential for ensuring that these tools are safe and effective in clinical applications.
Conclusion
The ISIC 2018 Challenge represents a crucial step forward in the automated analysis of skin lesions. The methodologies and results presented in this challenge provide valuable insights into the state of current algorithms, highlighting significant areas of success and ongoing challenges. Future work and subsequent challenges will benefit from these findings, advancing the field toward more reliable and generalizable machine learning applications in medical imaging.