Papers
Topics
Authors
Recent
2000 character limit reached

SPECTRE: CT Representation via SSL & Cross-Modal Pretraining

Updated 28 November 2025
  • The paper introduces SPECTRE, a dual-transformer model utilizing self-supervised and cross-modal pretraining for extracting robust 3D CT features.
  • It employs innovative volumetric tokenization and hierarchical design to balance local detail with global context, ensuring computational efficiency.
  • Empirical results show significant improvements in zero-shot classification and segmentation tasks across diverse clinical benchmarks.

Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) refers to a suite of methods for learning general-purpose, high-quality CT image representations using unlabeled or weakly labeled volumetric data and heterogeneous clinical sources. SPECTRE approaches address the unique computational and semantic challenges of 3D CT by combining scalable transformer-based self-supervised learning (SSL), vision-language pretraining via cross-modal contrast, and explicit architectural and data augmentations for large-scale medical imaging. The result is a representation backbone that is robust across organ systems, generalizes across diverse CT tasks, and enables strong zero-shot or low-shot clinical performance.

1. Dual-Transformer Architectures and Volumetric Tokenization

SPECTRE systems employ hierarchical, dual-transformer designs optimized for volumetric CT. The "local transformer" extracts high-resolution patch-wise features from 3D windows, while a "global transformer" fuses context across the volume by aggregating window-level summaries. Volumetric tokenization leverages anisotropic patch sizes—typically 16×16×816\times16\times8 voxels—to accommodate the inherent slice spacing in CT, reducing attention complexity while preserving anatomical structure. Three-dimensional rotary positional embeddings inject spatial coordinates directly into the model, and “box jittering” augments the spatial input for improved robustness. This dual-scale approach maintains tractable computation (≪O(N2)\ll O(N^2) scaling in input tokens) and facilitates both local detail and long-range context modeling (Claessens et al., 21 Nov 2025).

2. Pretraining Objectives: Self-Supervised and Cross-Modal Losses

SPECTRE pretraining proceeds in staged fashion:

Self-supervised pretraining (stage 1):

  • DINO-style teacher–student distillation learns invariances across random volumetric crops.
  • iBOT masked patch distillation reconstructs masked 3D patches from visible neighbors, forcing the encoder to learn fine anatomical and texture features.
  • The KoLeo regularizer promotes diversity among learned features.
  • The aggregated SSL loss:

LSSL=LDINO+LiBOT+0.1 LKoLeo\mathcal{L}_{\rm SSL} = \mathcal{L}_{\rm DINO} + \mathcal{L}_{\rm iBOT} + 0.1\,\mathcal{L}_{\rm KoLeo}

Cross-modal vision-language pretraining (stage 2):

  • SigLIP contrastive losses align volumetric CT representations with paired radiology report embeddings.
  • Texts are processed into LLM-paraphrased, multi-section reports, with both local and global image embeddings aligned to their corresponding linguistic counterparts.
  • The SigLIP loss is symmetric and relies on scaled cosine similarities:

LSigLIP=12(Lv→t+Lt→v)\mathcal{L}_{\rm SigLIP} = \frac{1}{2}(\mathcal{L}_{v \to t} + \mathcal{L}_{t \to v})

where Lv→t\mathcal{L}_{v \to t} is the negative log-likelihood over image–text pairs (Claessens et al., 21 Nov 2025).

This staged objective ensures that volumetric encoders extract features that are both geometrically consistent (via SSL) and semantically meaningful (via cross-modal vision-language alignment).

3. Cross-Modal Alignment and Multi-Granularity Fusion Strategies

A core principle in SPECTRE-style frameworks is the explicit alignment of image subregions (patches or slices) to granular units of clinical semantics—typically sentences, words, or report sections. Methods such as Similarity-driven Cross-Granularity Pre-training (SimCroP) extend standard global alignment to finer levels:

  • Each report sentence is encoded via a text encoder; each CT patch is encoded via a vision encoder.
  • Cosine similarity between sentence and patch embeddings identifies subregions most relevant to each semantic unit.
  • The top-K matching patches are aggregated, and alignment is enforced with symmetric InfoNCE (contrastive) loss.
  • Cross-granularity fusion combines global (instance) representations with word–patch level features using cross-attention, ensuring that both coarse and fine semantic cues are integrated for downstream decoding (Wang et al., 10 Sep 2025).

This multi-granularity fusion is crucial for spatially sparse clinical findings and improves both organ-level and focal-lesion tasks.

4. Large-Scale Data Sources, Augmentation, and Computational Scalability

SPECTRE pipelines leverage massive public CT datasets, encompassing various anatomical regions (e.g., chest, abdomen) and clinical contexts. Key preprocessing includes resampling to standard voxel grids, Hounsfield unit clipping, intensity normalization, and on-the-fly 3D cropping. Data augmentations such as random windowing, flipping, smoothing, noise, and report paraphrasing (via LLMs with LoRA adapters) are applied to both images and text.

Computational scalability is ensured via:

  • Windowed local attention (linear in number of windows).
  • Global attention on compact scan summaries (quadratic only in number of windows).
  • Mixed-precision training, memory adapters, and direct storage integration for large-batch learning (batch sizes up to 1536 for SSL) (Claessens et al., 21 Nov 2025).
  • All experiments are conducted on multi-node high-memory GPU clusters.

5. Downstream Evaluation Benchmarks and Performance

SPECTRE-style models are evaluated across multiple CT benchmarks:

  • Zero-shot cancer biomarker classification: Models trained without explicit target supervision enable direct kNN or linear probe evaluation on datasets such as LUNA16, DLCS, NSCLC-Radiomics, C4KC-KiTS, and Colorectal-Liver-Metastases. SPECTRE achieves mean AUC gains of 2–5 points versus prior foundation models (Claessens et al., 21 Nov 2025).
  • Volumetric semantic segmentation: Encoder-only Mask Transformer (SEoMT) architectures finetuned with SPECTRE backbones achieve state-of-the-art Dice on AMOS-CT, KiTS23, LiTS, and WORD.
  • Zero-shot text-to-image retrieval: The vision-language alignment of features enables natural language searching in large CT image cohorts, with SPECTRE improving Recall@5 by over 5Ă— relative to previous CT-CLIP-style models.

Empirical results consistently demonstrate that self-supervised and cross-modal objectives yield representations with superior generalization across multi-organ, multi-phase, and multi-granularity tasks.

6. Design Insights, Ablations, and Comparative Lessons

Analyses and ablation studies across SimCroP, SPECTRE, and related frameworks (e.g., DAE, MEDFORM, MAE+BiXLSTM) yield concrete design recommendations:

  • Multi-modal masked modeling (e.g., MAE/iBOT or MIM/MLM) consistently improves representation quality, especially in low-label regimes (Wang et al., 10 Sep 2025, Mazher et al., 29 Aug 2025).
  • Explicit sentence–patch (or higher-level) alignment mitigates mode collapse and improves localization of clinically significant structures.
  • Cross-granularity fusion is critical for robust transfer from organ-level (coarse) to nodule or vascular (fine) segmentation and detection.
  • Ablations confirm SSL is necessary for stable few-shot learning; omitting cross-modal alignment degrades downstream AUC by 5–7 points (Jung et al., 22 Jan 2025).
  • Cross-modal contrastive learning (e.g., SigLIP, InfoNCE, CMCL) enhances generalization and transfer by grounding image features in clinical semantics—modality-specific as in PET/CT or multi-modal as in CT/MRI (Valanarasu et al., 2023, Mazher et al., 29 Aug 2025).
  • Local masking and low-level augmentations (as in DAE) further improve anatomical detail recovery and stability after finetuning (Valanarasu et al., 2023).

A plausible implication is that comprehensive CT representation pipelines should integrate both SSL and multi-level cross-modal objectives, tuned masking/aggregation strategies, and domain-adaptive architectural components.

7. Practical Implications and Future Directions

SPECTRE sets a precedent for fully open, transformer-based 3D CT foundation models that dispense with proprietary data and manual labeling bottlenecks. Feature backbones pretrained with these objectives serve as drop-in encoders for downstream classification, detection, segmentation, or retrieval pipelines across radiology and oncology. Future research avenues include scaling foundation models to billion-parameter regimes, integrating more structured EHR supervision, domain adaptation across modalities such as MRI and PET, and end-to-end pipelines combining CT feature extraction with LLMs for clinical reporting and decision support.

References: (Claessens et al., 21 Nov 2025, Wang et al., 10 Sep 2025, Jung et al., 22 Jan 2025, Mazher et al., 29 Aug 2025, Valanarasu et al., 2023)

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE).