Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comprehensive Language-Image Pre-training (COLIPRI)

Updated 3 July 2026
  • The paper introduces a unified 3D vision-language encoder for CT imaging by combining contrastive image-text alignment, report generation, and masked image modeling.
  • It employs a specialized 3D ViT backbone, BiomedVLP text encoder, and transformer-based report decoder to effectively process volumetric data and radiology reports.
  • Experimental results show state-of-the-art performance in multimodal report generation, classification probing, and zero-shot tasks while addressing data scarcity in medical imaging.

Comprehensive Language-image Pre-training (COLIPRI) defines a methodology for creating unified 3D vision-language encoders tailored to medical CT imaging, specifically leveraging both image-only and paired image-text datasets. The approach addresses data scarcity by integrating multiple training objectives—contrastive image-text alignment, report generation, and self-supervised masked image modeling—along with domain-specific architectural choices. The resulting framework, the COLIPRI encoder family, demonstrates state-of-the-art performance in multimodal report generation, classification probing, and zero-shot evaluation on 3D medical imaging benchmarks, while remaining competitive in dense tasks such as semantic segmentation (Wald et al., 16 Oct 2025).

1. Model Components and Architecture

The COLIPRI framework is built around three key modules:

  • 3D Vision Encoder ("Primus-M"): Employs a 3D ViT-like backbone with patch size 8×8×88 \times 8 \times 8, converting volumetric CT data (e.g., 1603160^3 voxels) into 8000\approx8000 tokens. Each patch is linearly projected to 768-dimensional embeddings, with a transformer stack of L=24L=24 layers, H=12H=12 attention heads, and a hidden size of $3072$. Absolute positional embeddings are omitted in the final COLIPRI‐C variant for flexible input sizing at inference. Multi-head attention pooling (12 heads, "CLS-style" learned queries) aggregates to a single global vision embedding (dim = 768).
  • Text Encoder: Utilizes a BiomedVLP (ViT-based) text encoder, pretrained on chest X-ray captions and matched in patch and hidden size to the vision encoder. Multi-head attention pooling (freshly trained per experiment) outputs a 768-dimensional text embedding.
  • Report Generation Decoder (COLIPRI-CR/CRM variants): A transformer decoder ("Eva02", depth nd3n_d \approx 3–$6$, best N=6N=6) is cross-attended from text queries to frozen vision tokens. Causal and parallel masking are used for either next-token prediction or parallel captioning, respectively.

2. Pre-training Objectives and Optimization

The learning objective is a weighted composite of three loss terms, applied with alternating schedules and dataset availability:

  1. Vision–Language Contrastive Loss (LVLL_{\text{VL}}): Implements the symmetric InfoNCE/CLIP loss for aligning vision and text embeddings:

1603160^30

with 1603160^31 and 1603160^32.

  1. Report Generation Loss (1603160^33): Standard causal language-modeling cross-entropy over the “Findings” section tokens:

1603160^34

Combined in the total loss with weight 1603160^35.

  1. Vision-Only Masked Autoencoder Loss (1603160^36): Self-supervised masked autoencoding on randomly masked (1603160^37) 3D patches:

1603160^38

Introduced with weight 1603160^39 during the final 25% of training to limit interference with 8000\approx80000.

The overall pretraining alternates between batches for (i) vision–language (paired) CLIP plus RRG and (ii) vision-only (MAE) updates. The weighted sum per iteration is:

8000\approx80001

3. Datasets and Inductive Biases

Datasets

  • CT-RATE: 25,692 non-contrast chest CTs (50,988 reconstructions) paired with full radiology reports (Findings + Impression), typical size 8000\approx80002, median voxel 8000\approx80003 mm.
  • NLST: 8000\approx8000473,000 low-dose chest CTs (no reports, two reconstructions each).
  • Downstream: CT-RATE test (1,000), RAD-ChestCT (3,630 volumes, 16-class multi-abnormality).

Inductive Biases

  • Patch size 8000\approx80005: Preserves high in-plane resolution.
  • Sentence shuffle + LLM-generated report shortening: Regularization for handling report length, narrowing the domain between training and short zero-shot prompts.
  • Multi-scale masking for MAE: Block masking encourages locality.
  • Cropped field-of-view: Smaller crops (8000\approx80006–8000\approx80007) enhance robust semantic feature learning.

4. Implementation, Hyperparameters, and Compute

  • Optimizer: AdamW (weight decay 8000\approx80008 for CLIP, 8000\approx80009 for RRG, L=24L=240).
  • Batching: 8 (250k steps) or 16 (125k steps) for CLIP+RRG; 32 for RRG fine-tuning.
  • Learning Rates: L=24L=241 (batch 8), scaled to L=24L=242 (batch 16), cosine-decay schedule, L=24L=243 warm-up steps; RRG fine-tuning at L=24L=244; MAE at L=24L=245.
  • Hardware: All experiments conducted on a single node with 4 × A100 80GB GPUs.

5. Evaluation and Results

Multimodal Report Generation

  • BLEU-1/4: COLIPRI-CRM achieves 0.191/0.038, vs. CT-CHAT baseline 0.204/0.041.
  • ROUGE-L: 0.219 (CRM) vs. 0.237 (CT-CHAT).
  • METEOR: 0.240 (CRM), 0.260 (CT-CHAT).
  • Clinical factuality (RadBERT Macro F1): CRM = 0.404, best baseline = 0.386.
  • RadFact-CT F1(+): CRM = 0.269, baseline = 0.201.

Classification Probing

Dataset/Metric COLIPRI-C (CT) Merlin (CT) COLIPRI-C (RAD) Merlin (RAD)
AUPRC / AUROC / BA / F1 52.9 / 79.4 / 71.2 / 71.6 50.8 / 78.1 / 69.5 / 69.2 39.5 / 66.8 / 63.4 / 59.7 38.1 / 67.3 / 65.1 / 60.0

Zero-Shot Classification

  • Native prompts: COLIPRI-C—AUROC = 77.8, w-F1 = 75.2 on CT (fVLM: 77.9/75.8).
  • Short prompts: AUROC = 67.7, w-F1 = 63.8 (–10 pts), demonstrating prompt sensitivity.

Semantic Segmentation (Dice, avg. foreground)

Model LiTS Lung KiTS23
From scratch 85.1 71.6 80.8
MAE-pretrained 88.3 76.4 84.3
COLIPRI-C 85.9 72.0 81.7
COLIPRI-CRM 86.1 71.1 82.0

6. Ablation Studies and Analysis

  • Language augmentations: Sentence shuffling and LLM-shortening improve retrieval R@1 from 51 → 56 → 60 and zero-shot AUROC from 57 → 61 → 66.
  • Input FOV / patch size / pooling: Smaller crops (128–160³) outperform 192³ in retrieval; patch size 16 demonstrates a 5 point degradation; attention pooling yields best zero-shot performance (77.8 vs. max-pool 72.3).
  • Objective trade-offs: Softmax outperforms sigmoid for CLIP loss; moderate augmentation and batch 16/125k steps preferred.
  • Report Generation: Shallow decoder (L=24L=246), parallel captioning, and L=24L=247 optimize retrieval/classification balance.
  • MAE: Optimal with 75% mask ratio, block mask, decoder L=24L=248, weight L=24L=249, and late-stage application.

7. Limitations, Future Directions, and Clinical Implications

Limitations

  • Zero-shot classification shows marked prompt sensitivity: short prompts significantly degrade performance.
  • Addition of self-supervised MAE does not boost segmentation when combined with CLIP/RRG.
  • Segmentation improvements from COLIPRI remain modest compared to MAE pre-training alone.

Potential Improvements and Directions

  • Investigating prompt-contrastive training and prompt-invariant embedding spaces to mitigate prompt sensitivity.
  • Replacing MAE with embedding-level masked modeling (e.g., iBOT, DINOv2/v3) to better integrate global and local learning objectives.
  • Extending generative modeling to other report sections (Impression generation, radiology templates).
  • Clinical applications include open-vocabulary detection, image-to-report interfaces, zero-shot triage, and radiologist decision support through similar case retrieval.

In summary, COLIPRI operationalizes a three-pronged pretraining strategy—contrastive alignment, generative modeling, self-supervised masking—to achieve leading results in 3D chest CT report generation and classification, while highlighting ongoing challenges in prompt robustness and dense task generalization (Wald et al., 16 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Comprehensive Language-image Pre-training (COLIPRI).