Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenVocabCT: Open-Vocabulary CT Segmentation

Updated 1 April 2026
  • OpenVocabCT is an open-vocabulary segmentation framework that uses natural-language prompts to precisely segment 3D CT images.
  • It integrates a 3D image encoder, a domain-adapted text encoder, and a segmentation head with multi-granular contrastive learning to achieve high Dice scores.
  • The framework supports rapid clinical adaptation through prompt-based segmentation while addressing challenges like domain generalization and computational demands.

OpenVocabCT refers to a family of methodologies enabling open-vocabulary, text-driven segmentation and recognition in volumetric medical imaging, particularly 3D computed tomography (CT). These frameworks are characterized by universal, prompt-based segmentation: given any natural-language text prompt—potentially unseen during training—the model directly segments anatomical structures or pathological entities in CT volumes. The core of OpenVocabCT combines a large-scale 3D vision-language pretraining pipeline with multi-granular contrastive objectives derived from structured radiology reports, a dedicated 3D imaging backbone, and an extensible, language-guided segmentation head (Li et al., 8 Mar 2025).

1. Architecture and Data Pipeline

OpenVocabCT comprises three principal modules:

  1. 3D Image Encoder: A volumetric backbone, specifically STUNet-Large, processes an input CT volume I∈RD×H×WI\in\mathbb{R}^{D\times H\times W} to yield dense feature maps V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}.
  2. Text Encoder: A domain-adapted BIOLORD encoder maps tokens or free-form prompts TT into an embedding t∈RC\mathbf{t}\in\mathbb{R}^C.
  3. Segmentation Head: A lightweight connector (MLP or cross-attention) fuses image and language information, producing class-conditional logits via channelwise multiplication:

Mask logits=V  ⊙  q,\text{Mask logits} = \mathbf{V} \;\odot\; \mathbf{q},

where q=MLP(t)\mathbf{q}=MLP(\mathbf{t}).

During pretraining, both encoders operate over the large-scale CT-RATE dataset, aligning features for universal text-driven tasks. At inference, a direct prompt (e.g., "Right adrenal gland") produces a segmentation mask for the queried anatomy without pre-specifying a closed vocabulary (Li et al., 8 Mar 2025).

2. Pretraining Dataset and Text Granularity

OpenVocabCT leverages the CT-RATE dataset, which comprises 50,188 CT–report pairs across 21,304 patients. Original radiology reports are decomposed using LLMs (GPT-4, Llama-3) into fine-grained, organ- or disease-level captions:

  • Full-report level: capturing overall clinical impressions.
  • Organ-level: 5–10 short captions per scan, extracted and filtered for anatomical targets using RadLex entity matching.

This multi-granular corpus provides not only global (study-wide) supervision but also targeted language anchors for specific structures, enabling robust cross-modal alignment (Li et al., 8 Mar 2025).

3. Multi-Granular Contrastive Learning

Pretraining optimizes two core objectives:

  • Standard CLIP Loss: Aligns global-pooled image embeddings vi\mathbf{v}_i with full-report text embeddings ti\mathbf{t}_i via symmetrical InfoNCE:

LCLIP=12(LCLIPi2t+LCLIPt2i).\mathcal{L}_{\mathrm{CLIP}} = \frac12\left(\mathcal{L}^{i2t}_{\mathrm{CLIP}} + \mathcal{L}^{t2i}_{\mathrm{CLIP}}\right).

  • Multi-Granular Contrastive Loss (MGCL): Aligns vi\mathbf{v}_i to randomly sampled organ-level captions V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}0 for higher intra-study discrimination, again using symmetric InfoNCE:

V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}1

The total pretraining loss combines both:

V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}2

This regimen enables the model to generalize across synonymic labels, composite organ queries, and previously unseen phrasings at test time (Li et al., 8 Mar 2025).

4. Segmentation Objective and Fine-tuning

After pretraining, a STUNet decoder is affixed to the image encoder. For each target prompt, its encoded vector is projected to a set of channelwise segmentation queries. The outputs are supervised with a combination of softmax cross-entropy and Soft Dice loss:

V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}3

where

V∈RC×D′×H′×W′\mathbf{V}\in\mathbb{R}^{C\times D'\times H'\times W'}4

Fine-tuning leaves the vision and language backbones frozen, optimizing only the segmentation head and connector for robust prompt transfer (Li et al., 8 Mar 2025).

5. Empirical Performance and Ablation

OpenVocabCT demonstrates across-the-board improvements on nine public CT segmentation benchmarks:

Dataset Metric (Dice %) OpenVocabCT nnUNet UniMiSS SAT Pro CLIP-Driven
TotalSegmentator Avg (5 groups) 90.7 86.9 88.0 87.6 84.6
Tumor tasks (MSD…) Avg 76.2 71.8 — 64.1 75.4
FLARE22 13 organs 90.3 89.9 — 88.8 —
SegTHOR (unseen) Avg 85.1 84.0 — 79.1 —

Ablation studies reveal:

  • MGCL (organ-level contrast) is critical for generalization to new prompts and synonyms; on merged or composite organ prompts, Dice gains reach +30–40 over CLIP-Driven models.
  • The MLP connector yields the best trade-off—highest accuracy in-domain and best out-of-domain generalization compared to cross-attention.
  • Pretraining the image encoder (rather than random or CLIP-only initialization) provides the strongest downstream segmentation performance (Li et al., 8 Mar 2025).

6. Clinical Implications, Limitations, and Extensions

OpenVocabCT enables prompt-based medical image segmentation: radiologists or algorithms can propose arbitrary target structures at inference (e.g., "inferior vena cava," "liver lesion"), with no need for additional supervised voxels or retraining. This facilitates rapid adaptation to new diagnostic needs, supports treatment planning, and permits evaluation across anatomical and pathological axes not seen during development.

Limitations include:

  • Current pretraining domains are limited (thoracic CT in CT-RATE); performance on cranial, PET, or MRI data requires further adaptation.
  • LLM-curated textual captions may embed bias unless carefully filtered.
  • Large memory and compute requirements for volumetric inference persist.

Future directions highlighted include extension to other modalities (PET, MRI), new body regions, and integration into real-time clinical workflows (Li et al., 8 Mar 2025).

7. Position Within the Vision-Language Landscape

OpenVocabCT establishes a new paradigm in medical vision-language modeling, distinguished by (a) organ-level language supervision via report decomposition, (b) 3D volume-native encoders rather than slice-adapted 2D architectures, and (c) a unified, prompt-driven inference interface. By demonstrating universal and open-vocabulary segmentation capabilities, OpenVocabCT closes the gap between medical image understanding and open-set, language-guided recognition paradigms developed in natural imaging domains (Li et al., 8 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenVocabCT.