Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

167 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts (2312.17183v3)

Published 28 Dec 2023 in eess.IV and cs.CV

Abstract: In this study, we aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans from 72 segmentation datasets, across 497 classes, with careful standardization on both image scans and label space; (ii) for architecture design, we propose to inject medical knowledge into a text encoder via contrastive learning, and then formulate a universal segmentation model, that can be prompted by feeding in medical terminologies in text form; (iii) As a result, we have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters), demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets. We validate SAT as a foundational segmentation model, with better generalization ability on external (unseen) datasets, and can be further improved on specific tasks after fine-tuning adaptation. Comparing with interactive segmentation model, for example, MedSAM, segmentation model prompted by text enables superior performance, scalability and robustness. As a use case, we demonstrate that SAT can act as a powerful out-of-the-box agent for LLMs, enabling visual grounding in clinical procedures such as report generation. All the data, codes, and models in this work have been released.

References (81)

Citations (26)

View on Semantic Scholar

Summary

The paper presents a universal segmentation model that integrates multimodal anatomical knowledge with text prompts to overcome the limitations of specialized models in 3D medical imaging.
It constructs the SAT-DS dataset with over 22,000 scans spanning 497 anatomical classes and employs a Transformer-based, knowledge-enhanced representation learning approach to align visual and textual features.
The -Pro model, featuring 447M parameters, demonstrates competitive region-wise and class-wise performance against 72 specialist nnU-Nets, and shows promising zero-shot transfer capability in clinical settings.

The paper introduces a universal segmentation model termed , designed for 3D medical image segmentation using text prompts. The authors address the limitations of current "specialist" models, which are tailored for specific regions of interest (ROIs) and imaging modalities, and interactive models relying on real-time human interventions. The contributions of the paper span dataset construction, architecture design, and model evaluation.

The authors constructed a multi-modal knowledge tree on human anatomy, incorporating 6502 anatomical terminologies. They built a segmentation dataset, Segment Anything with Text Dataset (SAT-DS), comprising over 22,000 3D medical image scans from 72 segmentation datasets, standardized for image scans and label space, covering 497 classes.

For architecture, the paper formulates a universal segmentation model prompted by medical terminologies in text form, employing knowledge-enhanced representation learning. The model is evaluated across body regions, classes, and datasets, demonstrating performance comparable to 72 specialist nnU-Nets, each trained on individual datasets and totaling 2.2B parameters. Two models with varying sizes were trained, namely, -Nano, and -Pro. The authors claim to release all codes and models.

Introduction:

Medical image segmentation is critical for clinical applications like diagnosis and treatment planning.
The need for automated segmentation methods is driven by the time-consuming nature of manual segmentation and increasing medical data volumes.
Deep learning has led to specialized segmentation models, but they lack adaptability in diverse clinical settings and require distinct preprocessing for each dataset.

The paper aims to address the limitations of current models by presenting a knowledge-enhanced universal model for 3D medical volume segmentation with text prompts. This model distinguishes itself from previous medical segmentation paradigms and can be applied in clinics or integrated with LLMs.

Contributions:

Dataset: The authors constructed a knowledge tree based on medical knowledge sources, encompassing anatomy concepts and definitions. They curated over 22,000 3D medical image scans with 302,000 anatomical segmentation annotations, covering 497 categories from 72 datasets, named SAT-DS.
Architecture: They built a universal medical segmentation model using text prompts for flexible segmentation across modalities. The model leverages knowledge-enhanced representation learning and aligns visual features with corresponding text descriptions in the latent space. The text embeddings are used as queries in a Transformer-based architecture. Two models of varying sizes were trained, namely, -Nano, and -Pro.
Evaluation: Comprehensive metrics were devised for universal medical segmentation, including region-wise, organ-wise, and dataset-wise averages. Experiments demonstrated that -Pro, with 447M parameters, shows comparable performance to specialist nnU-Net models and exhibits generalization ability for zero-shot transfer to clinic data. The text encoder provides guidance for universal medical segmentation on 3D inputs, surpassing LLMs tailored for medical tasks.

Results:

The goal is to build a universal segmentation model for 3D medical images driven by text prompts. The universality should make it adaptable to clinic procedures with minimal extra efforts, addressing a broad range of clinical needs.
SAT-DS covers 497 anatomical targets on 8 regions and lesions of the human body, across 72 datasets. The authors trained -Pro and -Nano and compared them with nnU-Nets.
Evaluations were conducted from the perspective of anatomical regions, classes, and datasets.

Region-wise Results:

-Pro consistently outperforms nnU-Nets in three regions and shows comparable segmentation performance to the 72 nnU-Nets.
-Pro is approximately 1/5 of the assemble of nnU-Nets; While -Nano is even smaller, remarkably only 1/20 of the assemble of nnU-Nets in size.

Class-wise Results:

-Pro outperforms -Nano on most classes and exceeds nnU-Nets on 133/497 classes on DSC and 192/497 classes on NSD, including some important segmentation classes such as liver, pancreas, and lumbar vertebrae.
Averaging over all 497 classes, -Pro achieves 78.73 on DSC, which is about 4.26\% improvement over -Nano, and 77.71 on NSD, about 5.31\% improvement over -Nano.

Ablation Study:

Experiments were conducted to discuss the effect of different visual backbones and domain knowledge.
To save computational cost, all the experiment in this section are conduct on a subset of SAT-DS, termed ast SAT-DS-Nano, including 49 datasets, 13,303 images, 151,461 annotations and 429 classes.

Effect of Visual Backbone:

In addition to the ConvNet-based U-Net, two alternative backbones, namely, SwinUNETR and U-Mamba were considered to medical segmentation.
U-Net-CPT outperforms U-Mamba-CPT slightly on both DSC (0.35) and NSD (0.22) scores averaged over all classes.
Both U-Net-CPT and U-Mamba-CPT exceed SwinUNETR-CPT by a significant margin.

Effect of Text Encoder:

The impact of domain knowledge on building a text encoder for medical universal segmentation task was investigated.
The authors trained three -Nano models with three representative text encoders: the text encoder pre-trained on multimodal medical knowledge graph, MedCPT, and BERT-Base.
U-Net-Ours surpasses U-Net-CPT consistently on all regions and lesions, leading notable margins on both DSC (+1.54) and NSD (+2.65) scores after average over all classes.
The recall at 1 (R1) for BERT-Base is merely 0.08\%; The R1 for MedCPT is 11.19\%; By contrast, the proposed text encoder get 99.18\% R1.

Qualitative Results in Different Scenarios:

GPT-4 was utilized to directly extract the anatomical targets of interests from real clinical report and prompt to segment them on the clinical image, forming a fully automatic pipeline.
The zero-shot performance of -Pro was demonstrated on four cases randomly selected from clinical practice: abdominal MR examination, chest CT examination, abdominal CT examination, and lumbar spine CT examination.

Discussion:

-Pro has demonstrated comparable results to an assemble of 72 nnU-Nets specialized and trained on each dataset, and even surpass them on several regions and classes.
In both region-wise and class-wise evaluations, -Pro shows clear performance boost over -Nano, outperforming the latter on most region and classes, indicating that scaling law is also applicable to universal medical segmentation.
The proposed multi-modal knowledge graph on human anatomy, and via knowledge injection, demonstrates how it can enhance the segmentation performance, especially on these `tail' classes.
-Pro can be applied directly to real clinical data out of the scope of SAT-DS without extra annotation and fine-tuning, handling them with just one model.
The paper show that can be applied to segment targets extracted by GPT-4 from clinical report, providing explainable grounded report for the patients. This demonstrate the potential of as a grounding tool for generalist medical artificial intelligence.

Limitations:

The performance of -Pro still lags behind nnU-Net in some region including Brain, Spine and Abdomen, and many classes especially the lesions.
currently only supports text as prompts, thus not intended for scenes with human interaction requirements.
The distribution of SAT-DS is unbalanced.
The long-tail distribution in assembled dataset collection remain challenging for building an universal segmentation method.

Related Work:

The paper discusses specialist medical image segmentation, generalized medical image segmentation, universal medical image segmentation, and knowledge-enhanced representation learning in medical image analysis.

Dataset:

The authors collect two types of data: medical domain knowledge to train the text encoder, and medical segmentation data.
For Domain Knowledge, the Unified Medical Language System (UMLS) was exploited, and also search engines were prompted to retrieve knowledge. The authors construct a multimodal medical knowledge tree, in which the concepts (including both anatomical structures and lesions) are linked via the relations and further extended with their definitions, containing their characteristics.
For Segmentation Dataset, the authors collected and integrated 72 diverse publicly available medical segmentation datasets, totaling 22,186 scans including both CT and MRI and 302,033 segmentation annotations spanning 8 different regions of the human body, termed SAT-DS.

Method:

The paper consider two main stages, namely, multimodal knowledge injection and universal segmentation training.
They structured the multimodal medical knowledge data, and presents details to use them for visual-language pre-training.
They employ the text encoder to guide universal segmentation model training on SAT-DS dataset

Implementation Details:

The authors implement the multimodal knowledge injection procedure progressively.
They normalizes the image with unified voxel spacing, and set the maximal text prompts sampled in a batch of up to 32.

Experiment Settings:

They compare the performance of the proposed model with strong baseline nnU-Net.
The evaluations were conducted from three dimensions: class-wise evaluation, region-wise evaluation, and dataset-wise evaluation.
They quantitatively evaluate the segmentation performance from the perspective of region and boundary metrics, e.g., Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) respectively.

Conclusion:

The paper promote the progress of medical universal segmentation with text as prompt and knowledge enhancement.
The authors build up the largest and most comprehensive 3D medical segmentation datase, and the first multi-modal knowledge tree for human anatomy.
The final solution -Pro, contains 447M paramters, while demonstrating comparable performance to 72 specialist nnU-Nets.

In summary, the paper presents an approach to universal medical image segmentation using text prompts and knowledge enhancement, offering a solution to the limitations of specialized models. The results demonstrate competitive performance and generalization capabilities in clinical settings.

PDF Markdown

Tweets

https://twitter.com/WeidiXie/status/1792775618696028615

https://twitter.com/998944860777435136/status/1740703297739317314

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts (2312.17183v3)

Summary

Related Papers

Tweets