Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts (2312.17183v3)

Published 28 Dec 2023 in eess.IV and cs.CV

Abstract: In this study, we aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans from 72 segmentation datasets, across 497 classes, with careful standardization on both image scans and label space; (ii) for architecture design, we propose to inject medical knowledge into a text encoder via contrastive learning, and then formulate a universal segmentation model, that can be prompted by feeding in medical terminologies in text form; (iii) As a result, we have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters), demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets. We validate SAT as a foundational segmentation model, with better generalization ability on external (unseen) datasets, and can be further improved on specific tasks after fine-tuning adaptation. Comparing with interactive segmentation model, for example, MedSAM, segmentation model prompted by text enables superior performance, scalability and robustness. As a use case, we demonstrate that SAT can act as a powerful out-of-the-box agent for LLMs, enabling visual grounding in clinical procedures such as report generation. All the data, codes, and models in this work have been released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Journal of Medical Internet Research, 23(7):e26151, 2021.
  2. OpenAI (2023). Gpt-4 technical report, 2023.
  3. The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
  4. Neural segmentation of seeding rois (srois) for pre-surgical brain tractography. IEEE Transactions on Medical Imaging, 39(5):1655–1667, 2019.
  5. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629, 2018.
  6. A radiogenomic dataset of non-small cell lung cancer. Scientific Data, 5(1):1–9, 2018.
  7. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Transactions on Medical Imaging, 37(11):2514–2525, 2018.
  8. Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
  9. Universeg: Universal medical image segmentation. arXiv preprint arXiv:2304.06131, 2023.
  10. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision, pages 205–218. Springer, 2022.
  11. Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE Transactions on Medical Imaging, 39(7):2494–2505, 2020.
  12. 3d transunet: Advancing medical image segmentation through vision transformers. arXiv preprint arXiv:2310.07781, 2023.
  13. Sam-med2d. arXiv preprint arXiv:2308.16184, 2023.
  14. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 424–432. Springer, 2016.
  15. Collaborative learning of cross-channel clinical attention for radiotherapy-related esophageal fistula prediction from ct. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 212–220. Springer, 2020.
  16. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4023–4032, 2020.
  17. 3d deeply supervised network for automated segmentation of volumetric medical images. Medical Image Analysis, 41:40–54, 2017.
  18. Segvol: Universal and interactive volumetric medical image segmentation. arXiv preprint arXiv:2311.13385, 2023.
  19. Attention to lesion: Lesion-aware convolutional neural network for retinal optical coherence tomography image classification. IEEE Transactions on Medical Imaging, 38(8):1959–1970, 2019.
  20. Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction. IEEE Transactions on Medical Imaging, 39(11):3619–3629, 2020.
  21. Automatic multi-organ segmentation on abdominal ct with dense v-networks. IEEE Transactions on Medical Imaging, 37(8):1822–1834, 2018.
  22. 3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation. arXiv preprint arXiv:2306.13465, 2023.
  23. Ivan Gonzalez-Diaz. Dermaknet: Incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE journal of biomedical and health informatics, 23(2):547–559, 2018.
  24. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
  25. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 574–584, 2022.
  26. Comparison and evaluation of methods for liver segmentation from ct datasets. IEEE Transactions on Medical Imaging, 28(8):1251–1265, 2009.
  27. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge. Medical Image Analysis, page 101821, 2020.
  28. The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct. arXiv preprint arXiv:2307.01984, 2023.
  29. Isles 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Scientific Data, 9(1):762, 2022.
  30. Segment anything model for medical images? arXiv preprint arXiv:2304.14660, 2023.
  31. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021.
  32. Continual segment: Towards a single, unified and non-forgetting continual segmentation model of 143 whole-body organs in ct scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21140–21151, 2023.
  33. Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation. Medical Image Analysis, 69:101950, 2021.
  34. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NACCL-HLT), pages 4171–4186, 2019.
  35. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  36. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE Transactions on Medical Imaging, 38(11):2556–2568, 2019.
  37. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  38. Learning local shape and appearance for segmentation of knee cartilage in 3d mri. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 231–240, 2010.
  39. Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training. arXiv preprint arXiv:2309.06828, 2023.
  40. Attention based glaucoma detection: A large-scale database and cnn model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10571–10580, 2019.
  41. H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Transactions on Medical Imaging, 37(12):2663–2674, 2018.
  42. Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. Medical Image Analysis, 18(2):359–373, 2014.
  43. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21152–21164, 2023.
  44. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  45. Word: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. Medical Image Analysis, 82:102642–102642, 2022.
  46. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  47. Toward data-efficient learning: A benchmark for covid-19 ct lung and infection segmentation. Medical physics, 48(3):1197–1210, 2021.
  48. Unleashing the strengths of unlabeled data in pan-cancer abdominal organ quantification: the flare22 challenge. arXiv preprint arXiv:2308.05862, 2023.
  49. Abdomenct-1k: Is abdominal organ segmentation a solved problem? IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6695–6714, 2021.
  50. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv. org, (2206.01653), 2022.
  51. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision (3DV), pages 565–571. Ieee, 2016.
  52. Spineparsenet: spine parsing for volumetric mr image by a two-stage segmentation framework with semantic image representation. IEEE Transactions on Medical Imaging, 40(1):262–273, 2020.
  53. Brain tumor segmentation using convolutional neural networks in mri images. IEEE Transactions on Medical Imaging, 35(5):1240–1251, 2016.
  54. Han-seg: The head and neck organ-at-risk ct and mr segmentation dataset. Medical physics, 50(3):1917–1927, 2023.
  55. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241. Springer, 2015.
  56. Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53:197–207, 2019.
  57. Construction of a consistent high-definition spatio-temporal atlas of the developing brain using adaptive kernel regression. Neuroimage, 59(3):2255–2265, 2012.
  58. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis, 42:1–13, 2017.
  59. Automatic couinaud segmentation from ct volumes on liver using glc-unet. In International Workshop on Machine Learning in Medical Imaging, pages 274–282. Springer, 2019.
  60. Multitalent: A multi-dataset approach to medical image segmentation. arXiv preprint arXiv:2303.14444, 2023.
  61. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  62. Sam-med3d. arXiv preprint arXiv:2310.15161, 2023.
  63. Does non-covid-19 lung lesion help? investigating transferability in covid-19 ct image segmentation. Computer Methods and Programs in Biomedicine, 202:106004, 2021.
  64. Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence, 5(5), 2023.
  65. Integrating features from lymph node stations for metastatic lymph node detection. Computerized Medical Imaging and Graphics, 101:102108, 2022.
  66. K-diag: Knowledge-enhanced disease diagnosis in radiographic imaging. MICCAI Workshop on Big Task Small Data (BTSD), 2023.
  67. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21372–21383, October 2023.
  68. A survey on incorporating domain knowledge into deep learning for medical image analysis. Medical Image Analysis, 69:101985, 2021.
  69. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 171–180. Springer, 2021.
  70. Uniseg: A prompt-driven universal segmentation model as well as a strong representation learner. arXiv preprint arXiv:2304.03493, 2023.
  71. Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8280–8289, 2018.
  72. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1):4542, 2023.
  73. mmformer: Multimodal medical transformer for incomplete multimodal learning of brain tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 107–117. Springer, 2022.
  74. Continual learning for abdominal multi-organ and tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 35–45. Springer, 2023.
  75. Decoupled pyramid correlation network for liver tumor segmentation from ct images. Medical Physics, 49(11):7207–7221, 2022.
  76. Modality-aware mutual learning for multi-modal medical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 589–599. Springer, 2021.
  77. Large-scale long-tailed disease diagnosis on radiology images. arXiv preprint arXiv:2312.16151, 2023.
  78. nnformer: volumetric medical image segmentation via a 3d transformer. IEEE Transactions on Image Processing, 2023.
  79. Ted: Two-stage expert-guided interpretable diagnosis framework for microvascular invasion in hepatocellular carcinoma. Medical Image Analysis, 82:102575, 2022.
  80. Uncertainty-aware incremental learning for multi-organ segmentation. arXiv preprint arXiv:2103.05227, 2021.
  81. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging, 39(6):1856–1867, 2019.
Citations (26)

Summary

  • The paper presents a universal segmentation model that integrates multimodal anatomical knowledge with text prompts to overcome the limitations of specialized models in 3D medical imaging.
  • It constructs the SAT-DS dataset with over 22,000 scans spanning 497 anatomical classes and employs a Transformer-based, knowledge-enhanced representation learning approach to align visual and textual features.
  • The -Pro model, featuring 447M parameters, demonstrates competitive region-wise and class-wise performance against 72 specialist nnU-Nets, and shows promising zero-shot transfer capability in clinical settings.

The paper introduces a universal segmentation model termed , designed for 3D medical image segmentation using text prompts. The authors address the limitations of current "specialist" models, which are tailored for specific regions of interest (ROIs) and imaging modalities, and interactive models relying on real-time human interventions. The contributions of the paper span dataset construction, architecture design, and model evaluation.

The authors constructed a multi-modal knowledge tree on human anatomy, incorporating 6502 anatomical terminologies. They built a segmentation dataset, Segment Anything with Text Dataset (SAT-DS), comprising over 22,000 3D medical image scans from 72 segmentation datasets, standardized for image scans and label space, covering 497 classes.

For architecture, the paper formulates a universal segmentation model prompted by medical terminologies in text form, employing knowledge-enhanced representation learning. The model is evaluated across body regions, classes, and datasets, demonstrating performance comparable to 72 specialist nnU-Nets, each trained on individual datasets and totaling 2.2B parameters. Two models with varying sizes were trained, namely, -Nano, and -Pro. The authors claim to release all codes and models.

Introduction:

  • Medical image segmentation is critical for clinical applications like diagnosis and treatment planning.
  • The need for automated segmentation methods is driven by the time-consuming nature of manual segmentation and increasing medical data volumes.
  • Deep learning has led to specialized segmentation models, but they lack adaptability in diverse clinical settings and require distinct preprocessing for each dataset.

The paper aims to address the limitations of current models by presenting a knowledge-enhanced universal model for 3D medical volume segmentation with text prompts. This model distinguishes itself from previous medical segmentation paradigms and can be applied in clinics or integrated with LLMs.

Contributions:

  • Dataset: The authors constructed a knowledge tree based on medical knowledge sources, encompassing anatomy concepts and definitions. They curated over 22,000 3D medical image scans with 302,000 anatomical segmentation annotations, covering 497 categories from 72 datasets, named SAT-DS.
  • Architecture: They built a universal medical segmentation model using text prompts for flexible segmentation across modalities. The model leverages knowledge-enhanced representation learning and aligns visual features with corresponding text descriptions in the latent space. The text embeddings are used as queries in a Transformer-based architecture. Two models of varying sizes were trained, namely, -Nano, and -Pro.
  • Evaluation: Comprehensive metrics were devised for universal medical segmentation, including region-wise, organ-wise, and dataset-wise averages. Experiments demonstrated that -Pro, with 447M parameters, shows comparable performance to specialist nnU-Net models and exhibits generalization ability for zero-shot transfer to clinic data. The text encoder provides guidance for universal medical segmentation on 3D inputs, surpassing LLMs tailored for medical tasks.

Results:

  • The goal is to build a universal segmentation model for 3D medical images driven by text prompts. The universality should make it adaptable to clinic procedures with minimal extra efforts, addressing a broad range of clinical needs.
  • SAT-DS covers 497 anatomical targets on 8 regions and lesions of the human body, across 72 datasets. The authors trained -Pro and -Nano and compared them with nnU-Nets.
  • Evaluations were conducted from the perspective of anatomical regions, classes, and datasets.

Region-wise Results:

  • -Pro consistently outperforms nnU-Nets in three regions and shows comparable segmentation performance to the 72 nnU-Nets.
  • -Pro is approximately 1/5 of the assemble of nnU-Nets; While -Nano is even smaller, remarkably only 1/20 of the assemble of nnU-Nets in size.

Class-wise Results:

  • -Pro outperforms -Nano on most classes and exceeds nnU-Nets on 133/497 classes on DSC and 192/497 classes on NSD, including some important segmentation classes such as liver, pancreas, and lumbar vertebrae.
  • Averaging over all 497 classes, -Pro achieves 78.73 on DSC, which is about 4.26\% improvement over -Nano, and 77.71 on NSD, about 5.31\% improvement over -Nano.

Ablation Study:

  • Experiments were conducted to discuss the effect of different visual backbones and domain knowledge.
  • To save computational cost, all the experiment in this section are conduct on a subset of SAT-DS, termed ast SAT-DS-Nano, including 49 datasets, 13,303 images, 151,461 annotations and 429 classes.

Effect of Visual Backbone:

  • In addition to the ConvNet-based U-Net, two alternative backbones, namely, SwinUNETR and U-Mamba were considered to medical segmentation.
  • U-Net-CPT outperforms U-Mamba-CPT slightly on both DSC (0.35) and NSD (0.22) scores averaged over all classes.
  • Both U-Net-CPT and U-Mamba-CPT exceed SwinUNETR-CPT by a significant margin.

Effect of Text Encoder:

  • The impact of domain knowledge on building a text encoder for medical universal segmentation task was investigated.
  • The authors trained three -Nano models with three representative text encoders: the text encoder pre-trained on multimodal medical knowledge graph, MedCPT, and BERT-Base.
  • U-Net-Ours surpasses U-Net-CPT consistently on all regions and lesions, leading notable margins on both DSC (+1.54) and NSD (+2.65) scores after average over all classes.
  • The recall at 1 (R1) for BERT-Base is merely 0.08\%; The R1 for MedCPT is 11.19\%; By contrast, the proposed text encoder get 99.18\% R1.

Qualitative Results in Different Scenarios:

  • GPT-4 was utilized to directly extract the anatomical targets of interests from real clinical report and prompt to segment them on the clinical image, forming a fully automatic pipeline.
  • The zero-shot performance of -Pro was demonstrated on four cases randomly selected from clinical practice: abdominal MR examination, chest CT examination, abdominal CT examination, and lumbar spine CT examination.

Discussion:

  • -Pro has demonstrated comparable results to an assemble of 72 nnU-Nets specialized and trained on each dataset, and even surpass them on several regions and classes.
  • In both region-wise and class-wise evaluations, -Pro shows clear performance boost over -Nano, outperforming the latter on most region and classes, indicating that scaling law is also applicable to universal medical segmentation.
  • The proposed multi-modal knowledge graph on human anatomy, and via knowledge injection, demonstrates how it can enhance the segmentation performance, especially on these `tail' classes.
  • -Pro can be applied directly to real clinical data out of the scope of SAT-DS without extra annotation and fine-tuning, handling them with just one model.
  • The paper show that can be applied to segment targets extracted by GPT-4 from clinical report, providing explainable grounded report for the patients. This demonstrate the potential of as a grounding tool for generalist medical artificial intelligence.

Limitations:

  • The performance of -Pro still lags behind nnU-Net in some region including Brain, Spine and Abdomen, and many classes especially the lesions.
  • currently only supports text as prompts, thus not intended for scenes with human interaction requirements.
  • The distribution of SAT-DS is unbalanced.
  • The long-tail distribution in assembled dataset collection remain challenging for building an universal segmentation method.

Related Work:

  • The paper discusses specialist medical image segmentation, generalized medical image segmentation, universal medical image segmentation, and knowledge-enhanced representation learning in medical image analysis.

Dataset:

  • The authors collect two types of data: medical domain knowledge to train the text encoder, and medical segmentation data.
  • For Domain Knowledge, the Unified Medical Language System (UMLS) was exploited, and also search engines were prompted to retrieve knowledge. The authors construct a multimodal medical knowledge tree, in which the concepts (including both anatomical structures and lesions) are linked via the relations and further extended with their definitions, containing their characteristics.
  • For Segmentation Dataset, the authors collected and integrated 72 diverse publicly available medical segmentation datasets, totaling 22,186 scans including both CT and MRI and 302,033 segmentation annotations spanning 8 different regions of the human body, termed SAT-DS.

Method:

  • The paper consider two main stages, namely, multimodal knowledge injection and universal segmentation training.
  • They structured the multimodal medical knowledge data, and presents details to use them for visual-language pre-training.
  • They employ the text encoder to guide universal segmentation model training on SAT-DS dataset

Implementation Details:

  • The authors implement the multimodal knowledge injection procedure progressively.
  • They normalizes the image with unified voxel spacing, and set the maximal text prompts sampled in a batch of up to 32.

Experiment Settings:

  • They compare the performance of the proposed model with strong baseline nnU-Net.
  • The evaluations were conducted from three dimensions: class-wise evaluation, region-wise evaluation, and dataset-wise evaluation.
  • They quantitatively evaluate the segmentation performance from the perspective of region and boundary metrics, e.g., Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) respectively.

Conclusion:

  • The paper promote the progress of medical universal segmentation with text as prompt and knowledge enhancement.
  • The authors build up the largest and most comprehensive 3D medical segmentation datase, and the first multi-modal knowledge tree for human anatomy.
  • The final solution -Pro, contains 447M paramters, while demonstrating comparable performance to 72 specialist nnU-Nets.

In summary, the paper presents an approach to universal medical image segmentation using text prompts and knowledge enhancement, offering a solution to the limitations of specialized models. The results demonstrate competitive performance and generalization capabilities in clinical settings.