Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train (2306.16741v4)

Published 29 Jun 2023 in cs.CV

Abstract: Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downstream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art (SOTA) self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1, 4.8% Dice, and 5.5% F1 for classification, segmentation, and detection) and ST-Adapter (5.9% F1, 9.6% Dice, and 9.9% F1 for classification, segmentation, and detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.

Citations (38)

View on Semantic Scholar

Summary

The paper presents Endo-FM, a foundation model leveraging a video transformer architecture for efficient endoscopy video analysis.
It adopts a novel self-supervised pre-training strategy with a teacher-student framework on a comprehensive dataset of over 33,000 video clips.
Experimental results show up to 9.9% improvement over state-of-the-art methods in classification, segmentation, and detection tasks.

Overview of the Paper on Endo-FM: A Foundation Model for Endoscopy Video Analysis

The paper introduces "Endo-FM," a foundation model tailored specifically for the analysis of endoscopic videos. With the increasing utilization of endoscopic video content in medical fields such as disease diagnosis and surgical procedures, the need for an efficient and adaptable data model is evident. Endo-FM addresses this gap by employing a large-scale, self-supervised pre-training approach on an expansive dataset of endoscopic videos compiled from both public datasets and a newly collected dataset. This dataset covers a wide range of endoscopic protocols and diverse clinical scenarios.

Methodological Contributions

Architecture: Endo-FM utilizes a video transformer architecture, an extension of the Vision Transformer (ViT), specifically designed to handle the spatial-temporal complexities of video data. By leveraging attention mechanisms, this architecture effectively captures long-range dependencies both spatially and temporally, making it well-suited for the dynamic content typical in endoscopic procedures.
Self-supervised Pre-training: The model is pre-trained using a novel self-supervised strategy employing global and local views to improve robustness against spatial-temporal variations encountered in endoscopic videos. Through a teacher-student framework, discrepancies between different spatial-temporal views are minimized, allowing the model to learn invariant features that are transferable across different endoscopic conditions.
Dataset Construction: A substantial dataset was created, consolidating nine publicly available datasets and additional data from Shanghai's Renji Hospital. This comprehensive compilation resulted in a dataset consisting of over 33,000 video clips, providing a substantial foundation for model training.

Experimental Results

Endo-FM was evaluated against current state-of-the-art (SOTA) methods across three downstream tasks: classification, segmentation, and detection. The model achieved significant performance improvements, surpassing the best existing methods such as VCL and ST-Adapter by margins up to 9.9%. Such performance highlights the efficacy of Endo-FM's design in capturing critical spatial-temporal relationships in medical video contexts.

Implications and Future Directions

The implications of this research are profound, both practically and theoretically. Practically, Endo-FM could significantly enhance the accuracy and efficiency of endoscopic analysis, aiding in quicker and more precise disease diagnosis and surgical decision-making. Theoretically, the methodology presents a robust framework for developing domain-specific foundation models, which are critical given the unique characteristics and demands of medical imaging data.

Potential future directions could involve exploring the applicability of the Endo-FM methodology to other forms of medical imaging beyond endoscopy, adapting the model to other video-based medical diagnostics, or integrating text data, when available, to enhance the model's applicability in clinical environments where narrative reporting is critical. Further exploration into scalability and real-time application could also help transition the research from a theoretical framework to an operational tool in hospitals and clinics worldwide.

In summary, the paper offers a significant contribution to the field of medical imaging by providing a specialized foundation model for endoscopy video analysis, showcasing enhancements over general-purpose models by leveraging domain specificity. This research underscores the value of tailored solutions in complex fields such as healthcare, where precision can have substantial impacts on patient outcomes.

PDF Markdown

Related Papers

GitHub

GitHub - med-air/Endo-FM: [MICCAI'23] Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train (194 stars)