Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer Vision (2407.17904v2)

Published 25 Jul 2024 in cs.CV

Abstract: Over the past decade, computer vision applications in minimally invasive surgery have rapidly increased. Despite this growth, the impact of surgical computer vision remains limited compared to other medical fields like pathology and radiology, primarily due to the scarcity of representative annotated data. Whereas transfer learning from large annotated datasets such as ImageNet has been conventionally the norm to achieve high-performing models, recent advancements in self-supervised learning (SSL) have demonstrated superior performance. In medical image analysis, in-domain SSL pretraining has already been shown to outperform ImageNet-based initialization. Although unlabeled data in the field of surgical computer vision is abundant, the diversity within this data is limited. This study investigates the role of dataset diversity in SSL for surgical computer vision, comparing procedure-specific datasets against a more heterogeneous general surgical dataset across three different downstream surgical applications. The obtained results show that using solely procedure-specific data can lead to substantial improvements of 13.8%, 9.5%, and 36.8% compared to ImageNet pretraining. However, extending this data with more heterogeneous surgical data further increases performance by an additional 5.0%, 5.2%, and 2.5%, suggesting that increasing diversity within SSL data is beneficial for model performance. The code and pretrained model weights are made publicly available at https://github.com/TimJaspers0801/SurgeNet.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that increased dataset diversity using SurgeNet significantly outperforms ImageNet initialization, with gains up to 36.8%.
Methodologically, it pretrains a CAFormer-S18 model with the DINO framework on over 2.6 million frames from varied surgical procedures.
Results indicate enhanced generalization and finer segmentation of anatomical structures, underscoring the value of diverse pretraining in surgical vision.

Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer Vision

The paper "Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer Vision" explores the impact of dataset diversity within self-supervised learning (SSL) frameworks, specifically for applications in surgical computer vision. Recognizing the scarcity of representative annotated data as a significant limitation in the field, this paper addresses the potential improvements offered by leveraging diverse, unannotated data.

Objective and Methodology

The primary objective of this research was to investigate how dataset diversity affects self-supervised learning models designed for surgical computer vision tasks. To achieve this, the authors constructed SurgeNet, a comprehensive dataset consisting of over 2.6 million frames derived from various surgical procedures. This dataset was utilized to pretrain a CAFormer-S18 model using the DINO framework, followed by evaluations on three downstream surgical applications: laparoscopic cholecystectomy (LC), robot-assisted radical prostatectomy (RARP), and robot-assisted minimally invasive esophagectomy (RAMIE).

Self-Supervised Learning Datasets:

SurgeNet: 2,636,790 frames from 13 distinct surgical procedures.
SurgeNet-CHOLEC: 250,655 frames from laparoscopic cholecystectomy procedures.
SurgeNet-RAMIE: 377,287 frames from robot-assisted minimally invasive esophagectomy.
SurgeNet-RARP: 382,416 frames from robot-assisted prostatectomy procedures.

These datasets were aggregated from both public and private sources, ensuring high data diversity and quantity.

Downstream Datasets:

CholecSeg8k: 6,800 training frames and 1,280 test frames annotated for semantic segmentation from LC surgeries.
RAMIE: 749 training frames and 120 test frames focusing on anatomy recognition in RAMIE procedures.
RARP: 252 training frames and 30 test frames for anatomy recognition in RARP procedures.

Results and Discussion

The authors' findings strongly indicate that dataset diversity substantially improves SSL performance in surgical computer vision tasks. When comparing procedure-specific pretraining against ImageNet-based initialization, significant improvements were observed:

LC: 13.8% improvement.
RAMIE: 9.5% improvement.
RARP: 36.8% improvement.

Moreover, extending the pretraining dataset to include more heterogeneous data (i.e., the entire SurgeNet dataset) resulted in further enhancement:

LC: Additional 5.0% improvement.
RAMIE: Additional 5.2% improvement.
RARP: Additional 2.5% improvement.

Crucially, these results underscore that increased dataset diversity not only benefits model performance but also enhances generalization capabilities, particularly in scenarios involving smaller annotated datasets. Visual analyses conducted in the paper confirm that models pretrained on SurgeNet are more proficient at identifying and segmenting smaller, intricately detailed anatomical structures compared to those initialized with ImageNet.

The TSNE visualization highlighted in the paper further substantiates the hypothesis that diverse pretraining enables the model to learn more nuanced, procedure-specific representations without explicit supervision. This finding is pivotal, as it suggests that such pretraining methodologies can yield models that are more adaptable to varied surgical contexts.

Conclusion and Future Implications

This research offers compelling evidence that dataset diversity plays a crucial role in the effectiveness of SSL models in surgical computer vision. By constructing and utilizing SurgeNet, the authors demonstrated substantial performance gains across multiple tasks, highlighting the importance of diverse, comprehensive pretraining datasets. Given these results, future research should explore the potential of SurgeNet-pretrained models in other domains of surgical applications, such as phase recognition and action segmentation, where temporal dynamics are equally critical.

Additionally, there is an opportunity to investigate whether the learned representations from SurgeNet capture more complex relationships and temporal cues beyond spatial segmentation. Such inquiries could lead to more robust and versatile models for a broader array of surgical computer vision applications. The publicly available SurgeNet pretrained weights offer a valuable resource for the community, potentially serving as a superior alternative to traditional ImageNet-based initialization models in surgical contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/CSVisionPapers/status/1816965761623109869