Comprehensive Language-Image Pre-training (COLIPRI)

Updated 3 July 2026

The paper introduces a unified 3D vision-language encoder for CT imaging by combining contrastive image-text alignment, report generation, and masked image modeling.
It employs a specialized 3D ViT backbone, BiomedVLP text encoder, and transformer-based report decoder to effectively process volumetric data and radiology reports.
Experimental results show state-of-the-art performance in multimodal report generation, classification probing, and zero-shot tasks while addressing data scarcity in medical imaging.

Comprehensive Language-image Pre-training (COLIPRI) defines a methodology for creating unified 3D vision-language encoders tailored to medical CT imaging, specifically leveraging both image-only and paired image-text datasets. The approach addresses data scarcity by integrating multiple training objectives—contrastive image-text alignment, report generation, and self-supervised masked image modeling—along with domain-specific architectural choices. The resulting framework, the COLIPRI encoder family, demonstrates state-of-the-art performance in multimodal report generation, classification probing, and zero-shot evaluation on 3D medical imaging benchmarks, while remaining competitive in dense tasks such as semantic segmentation (Wald et al., 16 Oct 2025).

1. Model Components and Architecture

The COLIPRI framework is built around three key modules:

3D Vision Encoder ("Primus-M"): Employs a 3D ViT-like backbone with patch size $8 \times 8 \times 8$ , converting volumetric CT data (e.g., $160^3$ voxels) into $\approx8000$ tokens. Each patch is linearly projected to 768-dimensional embeddings, with a transformer stack of $L=24$ layers, $H=12$ attention heads, and a hidden size of $3072$. Absolute positional embeddings are omitted in the final COLIPRI‐C variant for flexible input sizing at inference. Multi-head attention pooling (12 heads, "CLS-style" learned queries) aggregates to a single global vision embedding (dim = 768).
Text Encoder: Utilizes a BiomedVLP (ViT-based) text encoder, pretrained on chest X-ray captions and matched in patch and hidden size to the vision encoder. Multi-head attention pooling (freshly trained per experiment) outputs a 768-dimensional text embedding.
Report Generation Decoder (COLIPRI-CR/CRM variants): A transformer decoder ("Eva02", depth $n_d \approx 3$ –$6$, best $N=6$ ) is cross-attended from text queries to frozen vision tokens. Causal and parallel masking are used for either next-token prediction or parallel captioning, respectively.

2. Pre-training Objectives and Optimization

The learning objective is a weighted composite of three loss terms, applied with alternating schedules and dataset availability:

Vision–Language Contrastive Loss ( $L_{\text{VL}}$ ): Implements the symmetric InfoNCE/CLIP loss for aligning vision and text embeddings:

$160^3$ 0

with $160^3$ 1 and $160^3$ 2.

Report Generation Loss ( $160^3$ 3): Standard causal language-modeling cross-entropy over the “Findings” section tokens:

$160^3$ 4

Combined in the total loss with weight $160^3$ 5.

Vision-Only Masked Autoencoder Loss ( $160^3$ 6): Self-supervised masked autoencoding on randomly masked ( $160^3$ 7) 3D patches:

$160^3$ 8

Introduced with weight $160^3$ 9 during the final 25% of training to limit interference with $\approx8000$ 0.

The overall pretraining alternates between batches for (i) vision–language (paired) CLIP plus RRG and (ii) vision-only (MAE) updates. The weighted sum per iteration is:

$\approx8000$ 1

3. Datasets and Inductive Biases

Datasets

CT-RATE: 25,692 non-contrast chest CTs (50,988 reconstructions) paired with full radiology reports (Findings + Impression), typical size $\approx8000$ 2, median voxel $\approx8000$ 3 mm.
NLST: $\approx8000$ 473,000 low-dose chest CTs (no reports, two reconstructions each).
Downstream: CT-RATE test (1,000), RAD-ChestCT (3,630 volumes, 16-class multi-abnormality).

Inductive Biases

Patch size $\approx8000$ 5: Preserves high in-plane resolution.
Sentence shuffle + LLM-generated report shortening: Regularization for handling report length, narrowing the domain between training and short zero-shot prompts.
Multi-scale masking for MAE: Block masking encourages locality.
Cropped field-of-view: Smaller crops ( $\approx8000$ 6– $\approx8000$ 7) enhance robust semantic feature learning.

4. Implementation, Hyperparameters, and Compute

Optimizer: AdamW (weight decay $\approx8000$ 8 for CLIP, $\approx8000$ 9 for RRG, $L=24$ 0).
Batching: 8 (250k steps) or 16 (125k steps) for CLIP+RRG; 32 for RRG fine-tuning.
Learning Rates: $L=24$ 1 (batch 8), scaled to $L=24$ 2 (batch 16), cosine-decay schedule, $L=24$ 3 warm-up steps; RRG fine-tuning at $L=24$ 4; MAE at $L=24$ 5.
Hardware: All experiments conducted on a single node with 4 × A100 80GB GPUs.

5. Evaluation and Results

Multimodal Report Generation

BLEU-1/4: COLIPRI-CRM achieves 0.191/0.038, vs. CT-CHAT baseline 0.204/0.041.
ROUGE-L: 0.219 (CRM) vs. 0.237 (CT-CHAT).
METEOR: 0.240 (CRM), 0.260 (CT-CHAT).
Clinical factuality (RadBERT Macro F1): CRM = 0.404, best baseline = 0.386.
RadFact-CT F1(+): CRM = 0.269, baseline = 0.201.

Classification Probing

Dataset/Metric	COLIPRI-C (CT)	Merlin (CT)	COLIPRI-C (RAD)	Merlin (RAD)
AUPRC / AUROC / BA / F1	52.9 / 79.4 / 71.2 / 71.6	50.8 / 78.1 / 69.5 / 69.2	39.5 / 66.8 / 63.4 / 59.7	38.1 / 67.3 / 65.1 / 60.0

Zero-Shot Classification

Native prompts: COLIPRI-C—AUROC = 77.8, w-F1 = 75.2 on CT (fVLM: 77.9/75.8).
Short prompts: AUROC = 67.7, w-F1 = 63.8 (–10 pts), demonstrating prompt sensitivity.

Semantic Segmentation (Dice, avg. foreground)

Model	LiTS	Lung	KiTS23
From scratch	85.1	71.6	80.8
MAE-pretrained	88.3	76.4	84.3
COLIPRI-C	85.9	72.0	81.7
COLIPRI-CRM	86.1	71.1	82.0

6. Ablation Studies and Analysis

Language augmentations: Sentence shuffling and LLM-shortening improve retrieval R@1 from 51 → 56 → 60 and zero-shot AUROC from 57 → 61 → 66.
Input FOV / patch size / pooling: Smaller crops (128–160³) outperform 192³ in retrieval; patch size 16 demonstrates a 5 point degradation; attention pooling yields best zero-shot performance (77.8 vs. max-pool 72.3).
Objective trade-offs: Softmax outperforms sigmoid for CLIP loss; moderate augmentation and batch 16/125k steps preferred.
Report Generation: Shallow decoder ( $L=24$ 6), parallel captioning, and $L=24$ 7 optimize retrieval/classification balance.
MAE: Optimal with 75% mask ratio, block mask, decoder $L=24$ 8, weight $L=24$ 9, and late-stage application.

7. Limitations, Future Directions, and Clinical Implications

Limitations

Zero-shot classification shows marked prompt sensitivity: short prompts significantly degrade performance.
Addition of self-supervised MAE does not boost segmentation when combined with CLIP/RRG.
Segmentation improvements from COLIPRI remain modest compared to MAE pre-training alone.

Potential Improvements and Directions

Investigating prompt-contrastive training and prompt-invariant embedding spaces to mitigate prompt sensitivity.
Replacing MAE with embedding-level masked modeling (e.g., iBOT, DINOv2/v3) to better integrate global and local learning objectives.
Extending generative modeling to other report sections (Impression generation, radiology templates).
Clinical applications include open-vocabulary detection, image-to-report interfaces, zero-shot triage, and radiologist decision support through similar case retrieval.

In summary, COLIPRI operationalizes a three-pronged pretraining strategy—contrastive alignment, generative modeling, self-supervised masking—to achieve leading results in 3D chest CT report generation and classification, while highlighting ongoing challenges in prompt robustness and dense task generalization (Wald et al., 16 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Comprehensive language-image pre-training for 3D medical image understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Comprehensive Language-image Pre-training (COLIPRI).