Papers
Topics
Authors
Recent
Search
2000 character limit reached

CT-GLIP: Grounded 3D CT Vision-Language Pretraining

Updated 1 April 2026
  • CT-GLIP is a multimodal pretraining framework that aligns full-body 3D CT scan features with organ-specific radiology texts using contrastive objectives.
  • It integrates dual-modality architectures with CNN and ViT-based 3D encoders alongside a frozen BioClinicalBERT text encoder to generate normalized organ embeddings.
  • The method demonstrates robust zero-shot organ recognition and abnormality detection, significantly enhancing downstream segmentation and detection tasks.

CT-GLIP (Grounded Language-Image Pretraining with CT scans) is a multimodal vision–language pretraining method designed for 3D medical imaging, specifically full-body computed tomography (CT) scans. The framework introduces the construction of organ-level image–text pairs, a large abnormality dictionary, and a set of contrastive objectives to align organ- and abnormality-level features with their diagnostic textual descriptions. Unlike prior Medical Vision-Language Pretraining (Med-VLP) approaches, which predominately focused on 2D images of a single body region (e.g., chest X-rays), CT-GLIP targets the complex semantics and sparsity of 3D volumetric imaging by utilizing a large-scale multimodal dataset of CT scans paired with radiology reports. The result is an embedding space that enables robust zero-shot organ and abnormality recognition and enhances fine-tuning performance on downstream clinical tasks (Lin et al., 2024).

1. Model Architecture and Optimization Objectives

CT-GLIP employs a dual-modality architecture that integrates both 3D vision encoders and a pretrained text encoder to facilitate grounded multimodal representation learning.

Vision Encoders:

  • CT-GLIP supports two types of 3D encoders:
    • CNN-based: nnU-Net backbone, using the highest-resolution feature map.
    • ViT-based: MiT (Medical Vision Transformer).
  • For a given 3D CT volume ViV_i, the encoder ff outputs a feature map FiF_i. Per-organ pooling via segmentation masks yields normalized embeddings for each organ:

vij=f(Vi)[maskij]f(Vi)[maskij]v_{ij} = \frac{f(V_i)[\text{mask}_{ij}]}{\|f(V_i)[\text{mask}_{ij}]\|}

where maskij\text{mask}_{ij} refers to the mask for the jj-th organ in scan ii.

Text Encoder:

  • A frozen expert clinical model, BioClinicalBERT, processes each organ or abnormality description TijT_{ij}:

tij=g(Tij)g(Tij)t_{ij} = \frac{g(T_{ij})}{\|g(T_{ij})\|}

where gg denotes the encoder mapping.

Multimodal Contrastive Losses:

  • The total training objective is a weighted sum of organ-text alignment (ff0), abnormality-text alignment (ff1), and segmentation loss (ff2), with coefficients ff3, temperature ff4:

ff5

  • Organ-Text and Abnormality-Text Alignments:
    • Contrastive losses are defined across the batch and over organs or abnormalities, maximizing similarity of matched pairs versus negatives (with expanded denominator leveraging the abnormality dictionary).
    • The auxiliary segmentation objective combines cross-entropy and Dice loss, using pseudo-labels from TotalSegmentator.

2. Construction of Multimodal Organ-Level Pairs

Organ Segmentation:

  • TotalSegmentator, an automated tool, is applied to each CT volume to produce up to 104 organ masks per scan.
  • Per-organ pooling is performed on the vision encoder’s highest-resolution output over these masks, yielding one embedding per organ.

Radiology Report Parsing:

  • LLaMA-2, coupled with manual verification, splits each radiology report into per-organ diagnostic sentences.
  • For organ-text alignment, templated sentences are generated (“This is a {organ} in the CT scan.”).
  • For abnormality-text alignment, the real per-organ diagnostic sentence from the report is used if an abnormality exists; otherwise, a normal-template sentence is inserted (“no evident abnormality in {organ}”).

3. Abnormality Dictionary and Hard Negative Sampling

Abnormality Dictionary Construction:

  • For each of the 104 organs, up to 512 diverse abnormality descriptions are compiled.
  • This dictionary serves as a hard negative mining resource for training, improving model discrimination.

Training Procedure:

  • For organs without abnormalities in a given scan (ff6), ff7 dictionary entries are sampled, generating ff8 hard negative texts.
  • These augment the denominator in the abnormality-text loss, enforcing finer distinction between similar pathologic and normal findings.
Component Function Source/Method
TotalSegmentator 3D organ segmentation Automated mask generation
LLaMA-2 + Manual check Per-organ report parsing Diagnostic sentence extraction
Abnormality dictionary Negative sampling, semantic coverage Up to 512 entries per organ

4. Training Data, Preprocessing, and Hyperparameters

Dataset Composition:

  • Pretraining: 17,702 patients, 44,011 organ–text pairs, covering 104 organs.
  • Zero-shot evaluation: 1,130 patients (test set), focused on 16 prevalent abnormalities across 7 major organs (spleen, pancreas, aorta, gallbladder, kidney, liver, lung).
  • Downstream fine-tuning: Multi-cancer screening dataset of 700 noncontrast CTs over 7 cancer types, split into train/val/test (448/112/140).

Preprocessing:

  • 3D resampling to uniform voxel spacing, pseudo-segmentation via TotalSegmentator, automated templating for organ/abnormality texts.

Implementation Details:

  • Batch size: 8 on 4 × V100 GPUs.
  • Training epochs: 20.
  • Optimizer: Adam (ff9).
  • Learning rate schedule: Cosine decay from FiF_i0 to FiF_i1.
  • Text encoder is kept frozen throughout training.

5. Zero-Shot Performance and Downstream Evaluation

Zero-Shot Organ and Abnormality Recognition:

  • Organ classification: 104-way, top-1 accuracy.
  • Abnormality detection: Precision, sensitivity, F1, and AUC.

Performance Results:

  • CNN (nnUNet backbone):
    • Vanilla CLIP: 0% organ accuracy, AUC 52.23%.
    • +AT only: 0.03% organ accuracy, AUC 66.00%.
    • +AT+OT: 86.9% organ accuracy, AUC 66.76%.
    • +AT+OT+A-Dict: 86.2% organ accuracy, AUC 68.63%.
  • ViT (MiT backbone):
    • Vanilla CLIP: 0% organ accuracy, AUC 52.37%.
    • +AT+OT+A-Dict: 84.9% organ accuracy, AUC 71.90%.

Downstream Fine-Tuning (Multi-Cancer Screening):

Backbone Init Dice (%) AUC
nnUNet Scratch 29.88 82.12
CLIP 33.45 87.32
CT-GLIP 34.70 89.55
MiT Scratch 22.68 80.94
CLIP 30.65 87.00
CT-GLIP 35.77 88.60

This demonstrates consistent improvements over both training from scratch and vanilla CLIP-initialized models across segmentation and detection tasks (Lin et al., 2024).

6. Ablation Analyses, Insights, and Limitations

Contributions of Training Components:

  • Organ-text alignment (FiF_i2) is critical for high-accuracy organ recognition.
  • Abnormality-text alignment (FiF_i3), particularly with dictionary-augmented hard negatives, drives substantial increases in zero-shot abnormality detection (AUC).

Qualitative Results:

  • Nearest-neighbor retrieval in the joint embedding space enables plausible zero-shot classification of both organs (104 classes) and pathologies, illustrating effective semantic compositionality.

Limitations:

  • Pseudo-segmentation mask quality may affect feature pooling.
  • The abnormality dictionary, capped at 512 entries per organ, may not exhaustively represent rare or subtle diagnostic phrasing.
  • Report parsing remains partially manual or dependent on LLM outputs (LLaMA-2 plus human verification).

Future Directions:

  • Expansion of the abnormality dictionary for broader coverage.
  • Advanced LLM-based structuring of radiology reports for automation.
  • Integration of 3D detection/localization heads.
  • Extension to other imaging modalities (MRI, PET) and finer-grained spatial proposals.

7. Context and Significance

CT-GLIP represents a substantial advancement in Medical Vision-Language Pretraining by addressing full-body 3D CT imaging and aligning deep representations of complex anatomy and pathology with natural language descriptions. The framework demonstrates robust zero-shot capabilities and strong downstream transfer, opening pathways for AI systems to generalize across rare pathologies and diverse anatomical sites. Its methodological innovations—organ-wise contrastive pairing, augmented textual negatives, and large-scale multimodal training—lay the foundation for further research into scalable multimodal representation learning in medical imaging (Lin et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CT-GLIP.