VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging (2406.05285v3)

Published 7 Jun 2024 in cs.CV

Abstract: Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all these challenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting 127 classes) and 3D interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks. Additionally, VISTA3D's 3D interactive design allows efficient human correction, and a novel 3D supervoxel method that distills 2D pretrained backbones grants VISTA3D top 3D zero-shot performance. We believe the model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model. Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a unified foundation model that integrates both automatic and interactive segmentation for 3D CT imaging.
The paper employs a dual-branch architecture with a shared encoder and decoupled decoders to achieve robust segmentation performance across 127 anatomical structures.
The paper demonstrates reduced annotation efforts and promising cross-domain applicability through effective zero-shot segmentation capabilities.

VISTA3D: Versatile Imaging Segmentation and Annotation Model for 3D Computed Tomography

The paper presents VISTA3D, a versatile foundation model for 3D computed tomography (CT) image segmentation, designed to address the limitations of existing segmentation approaches in medical imaging. The model systematically integrates both automatic segmentation and interactive annotation capabilities, offering significant advancements in the precision and adaptability of image segmentation for medical applications.

The primary feature of VISTA3D lies in its ability to deliver state-of-the-art segmentation performance across 127 anatomical structures and various lesions. This is achieved using a carefully curated dataset of 11,454 CT volumes, providing robust training data to foster high accuracy in segmentation tasks. Specifically, the model exhibits strong out-of-the-box segmentation capabilities for common anatomical classes, reinforced by its ability to engage in zero-shot learning, a feat typically challenging in the medical imaging domain due to limited data for rare conditions.

VISTA3D's architecture stands out with its dual-branch structure: the automatic branch, leveraging the encoder-decoder paradigm with a convolutional neural network backbone (SegResNet), and the interactive branch, which relies on user-provided input to finetune or enhance segmentation outcomes actively. The shared encoder ensures efficient computation and coherent feature extraction between both functions, while the decoupled decoders optimize task-specific performance.

Quantitatively, VISTA3D demonstrates competitive results against known dataset-specific models such as nnU-Net and Auto3DSeg, yielding comparable dice scores across a diverse range of datasets. Particularly, the model achieves superior performance in scenarios requiring zero-shot segmentation, benefiting from its innovative use of supervoxels for enhanced generalization. This performance is especially highlighted in cross-domain applications, as tested on non-human CT data, showcasing VISTA3D's potential for broader applicability.

The model's efficiency extends to user-interactive correction of automated outputs, significantly reducing annotation efforts by responding aptly to minimal user prompts. This capability is critical in medical imaging, where precision and adaptability are paramount.

Future implications of this work emphasize increasing automated and interactive functionalities, predictive accuracy across heterogeneous datasets, and extending the model's applicability beyond CT imaging to other medical modalities. Moreover, ongoing research could explore further leveraging generative models or advanced neural structures like stronger transformers to boost VISTA3D's segmentation capabilities.

In conclusion, the VISTA3D model represents a notable progression in 3D CT image segmentation, merging sophistication in automatic processing with user-directed flexibility. It serves as a versatile foundation model, paving the way for future advancements in medical image analysis amidst ever-evolving clinical demands and technological innovations.

Related Papers

GitHub

GitHub - Project-MONAI/VISTA: MONAI Versatile Imaging Segmentation and Annotation (129 stars)

Tweets

https://twitter.com/Dr_Alex_Crimi/status/1857935790644289727

YouTube

Show All Videos