CT-GLIP: Grounded 3D CT Vision-Language Pretraining

Updated 1 April 2026

CT-GLIP is a multimodal pretraining framework that aligns full-body 3D CT scan features with organ-specific radiology texts using contrastive objectives.
It integrates dual-modality architectures with CNN and ViT-based 3D encoders alongside a frozen BioClinicalBERT text encoder to generate normalized organ embeddings.
The method demonstrates robust zero-shot organ recognition and abnormality detection, significantly enhancing downstream segmentation and detection tasks.

CT-GLIP (Grounded Language-Image Pretraining with CT scans) is a multimodal vision–language pretraining method designed for 3D medical imaging, specifically full-body computed tomography (CT) scans. The framework introduces the construction of organ-level image–text pairs, a large abnormality dictionary, and a set of contrastive objectives to align organ- and abnormality-level features with their diagnostic textual descriptions. Unlike prior Medical Vision-Language Pretraining (Med-VLP) approaches, which predominately focused on 2D images of a single body region (e.g., chest X-rays), CT-GLIP targets the complex semantics and sparsity of 3D volumetric imaging by utilizing a large-scale multimodal dataset of CT scans paired with radiology reports. The result is an embedding space that enables robust zero-shot organ and abnormality recognition and enhances fine-tuning performance on downstream clinical tasks (Lin et al., 2024).

1. Model Architecture and Optimization Objectives

CT-GLIP employs a dual-modality architecture that integrates both 3D vision encoders and a pretrained text encoder to facilitate grounded multimodal representation learning.

Vision Encoders:

CT-GLIP supports two types of 3D encoders:
- CNN-based: nnU-Net backbone, using the highest-resolution feature map.
- ViT-based: MiT (Medical Vision Transformer).
For a given 3D CT volume $V_i$ , the encoder $f$ outputs a feature map $F_i$ . Per-organ pooling via segmentation masks yields normalized embeddings for each organ:

$v_{ij} = \frac{f(V_i)[\text{mask}_{ij}]}{\|f(V_i)[\text{mask}_{ij}]\|}$

where $\text{mask}_{ij}$ refers to the mask for the $j$ -th organ in scan $i$ .

Text Encoder:

A frozen expert clinical model, BioClinicalBERT, processes each organ or abnormality description $T_{ij}$ :

$t_{ij} = \frac{g(T_{ij})}{\|g(T_{ij})\|}$

where $g$ denotes the encoder mapping.

Multimodal Contrastive Losses:

The total training objective is a weighted sum of organ-text alignment ( $f$ 0), abnormality-text alignment ( $f$ 1), and segmentation loss ( $f$ 2), with coefficients $f$ 3, temperature $f$ 4:

$f$ 5

Organ-Text and Abnormality-Text Alignments:
- Contrastive losses are defined across the batch and over organs or abnormalities, maximizing similarity of matched pairs versus negatives (with expanded denominator leveraging the abnormality dictionary).
- The auxiliary segmentation objective combines cross-entropy and Dice loss, using pseudo-labels from TotalSegmentator.

2. Construction of Multimodal Organ-Level Pairs

Organ Segmentation:

TotalSegmentator, an automated tool, is applied to each CT volume to produce up to 104 organ masks per scan.
Per-organ pooling is performed on the vision encoder’s highest-resolution output over these masks, yielding one embedding per organ.

Radiology Report Parsing:

LLaMA-2, coupled with manual verification, splits each radiology report into per-organ diagnostic sentences.
For organ-text alignment, templated sentences are generated (“This is a {organ} in the CT scan.”).
For abnormality-text alignment, the real per-organ diagnostic sentence from the report is used if an abnormality exists; otherwise, a normal-template sentence is inserted (“no evident abnormality in {organ}”).

3. Abnormality Dictionary and Hard Negative Sampling

Abnormality Dictionary Construction:

For each of the 104 organs, up to 512 diverse abnormality descriptions are compiled.
This dictionary serves as a hard negative mining resource for training, improving model discrimination.

Training Procedure:

For organs without abnormalities in a given scan ( $f$ 6), $f$ 7 dictionary entries are sampled, generating $f$ 8 hard negative texts.
These augment the denominator in the abnormality-text loss, enforcing finer distinction between similar pathologic and normal findings.

Component	Function	Source/Method
TotalSegmentator	3D organ segmentation	Automated mask generation
LLaMA-2 + Manual check	Per-organ report parsing	Diagnostic sentence extraction
Abnormality dictionary	Negative sampling, semantic coverage	Up to 512 entries per organ

4. Training Data, Preprocessing, and Hyperparameters

Dataset Composition:

Pretraining: 17,702 patients, 44,011 organ–text pairs, covering 104 organs.
Zero-shot evaluation: 1,130 patients (test set), focused on 16 prevalent abnormalities across 7 major organs (spleen, pancreas, aorta, gallbladder, kidney, liver, lung).
Downstream fine-tuning: Multi-cancer screening dataset of 700 noncontrast CTs over 7 cancer types, split into train/val/test (448/112/140).

Preprocessing:

3D resampling to uniform voxel spacing, pseudo-segmentation via TotalSegmentator, automated templating for organ/abnormality texts.

Implementation Details:

Batch size: 8 on 4 × V100 GPUs.
Training epochs: 20.
Optimizer: Adam ( $f$ 9).
Learning rate schedule: Cosine decay from $F_i$ 0 to $F_i$ 1.
Text encoder is kept frozen throughout training.

5. Zero-Shot Performance and Downstream Evaluation

Zero-Shot Organ and Abnormality Recognition:

Organ classification: 104-way, top-1 accuracy.
Abnormality detection: Precision, sensitivity, F1, and AUC.

Performance Results:

CNN (nnUNet backbone):
- Vanilla CLIP: 0% organ accuracy, AUC 52.23%.
- +AT only: 0.03% organ accuracy, AUC 66.00%.
- +AT+OT: 86.9% organ accuracy, AUC 66.76%.
- +AT+OT+A-Dict: 86.2% organ accuracy, AUC 68.63%.
ViT (MiT backbone):
- Vanilla CLIP: 0% organ accuracy, AUC 52.37%.
- +AT+OT+A-Dict: 84.9% organ accuracy, AUC 71.90%.

Downstream Fine-Tuning (Multi-Cancer Screening):

Backbone	Init	Dice (%)	AUC
nnUNet	Scratch	29.88	82.12
	CLIP	33.45	87.32
	CT-GLIP	34.70	89.55
MiT	Scratch	22.68	80.94
	CLIP	30.65	87.00
	CT-GLIP	35.77	88.60

This demonstrates consistent improvements over both training from scratch and vanilla CLIP-initialized models across segmentation and detection tasks (Lin et al., 2024).

6. Ablation Analyses, Insights, and Limitations

Contributions of Training Components:

Organ-text alignment ( $F_i$ 2) is critical for high-accuracy organ recognition.
Abnormality-text alignment ( $F_i$ 3), particularly with dictionary-augmented hard negatives, drives substantial increases in zero-shot abnormality detection (AUC).

Qualitative Results:

Nearest-neighbor retrieval in the joint embedding space enables plausible zero-shot classification of both organs (104 classes) and pathologies, illustrating effective semantic compositionality.

Limitations:

Pseudo-segmentation mask quality may affect feature pooling.
The abnormality dictionary, capped at 512 entries per organ, may not exhaustively represent rare or subtle diagnostic phrasing.
Report parsing remains partially manual or dependent on LLM outputs (LLaMA-2 plus human verification).

Future Directions:

Expansion of the abnormality dictionary for broader coverage.
Advanced LLM-based structuring of radiology reports for automation.
Integration of 3D detection/localization heads.
Extension to other imaging modalities (MRI, PET) and finer-grained spatial proposals.

7. Context and Significance

CT-GLIP represents a substantial advancement in Medical Vision-Language Pretraining by addressing full-body 3D CT imaging and aligning deep representations of complex anatomy and pathology with natural language descriptions. The framework demonstrates robust zero-shot capabilities and strong downstream transfer, opening pathways for AI systems to generalize across rare pathologies and diverse anatomical sites. Its methodological innovations—organ-wise contrastive pairing, augmented textual negatives, and large-scale multimodal training—lay the foundation for further research into scalable multimodal representation learning in medical imaging (Lin et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CT-GLIP.