Task-Agnostic CT Pretraining (TAP-CT)
- The paper introduces a task-agnostic pretraining paradigm that leverages large-scale CT and video datasets to build robust CT foundation models without disease-specific biases.
- It employs architectural innovations like 3D vision transformers and hybrid CNNs to capture volumetric features and support multi-task radiological analysis.
- The approach demonstrates state-of-the-art zero-shot, fine-tuned segmentation, classification, and retrieval performance, setting new baselines for radiological AI.
Task-Agnostic Pretraining of CT Foundation Models (TAP-CT) refers to a suite of methodologies for developing large-scale, general-purpose computed tomography (CT) models by leveraging self-supervised or cross-domain objectives that do not encode explicit task- or disease-specific biases at pretraining. TAP-CT frameworks have demonstrated robust zero-shot and fine-tuned performance across diverse CT analysis problems, including segmentation, classification, retrieval, and multimodal integration, driven by the scale and heterogeneity of unlabeled medical imaging corpora, and by architectural advances in 3D vision transformers and convolutional neural networks. This paradigm supports highly transferable, frozen CT representations that minimize the need for extensive downstream fine-tuning or retraining, establishing new baselines for low-label and multi-task radiological AI development (Veenboer et al., 30 Nov 2025).
1. Foundations and Motivation
Task-agnostic pretraining of CT foundation models arose in response to several limitations of conventional supervised pretraining in medical imaging. First, CT volumes are resource-intensive to label and existing medical CT benchmarks rarely exceed tens of thousands of annotated studies, insufficient for robust 3D representation learning given the parameter scales of modern neural networks. Second, supervised pretraining on in-domain CT (e.g., via lesion or organ segmentation tasks) may overfit to the distribution of specific annotation protocols or clinical endpoints, hindering generalization to new pathology or to different imaging tasks (Ke et al., 2023, Veenboer et al., 30 Nov 2025).
TAP-CT strategies relinquish task-specific supervision during pretraining, instead adopting self-supervised/contrastive objectives or pretraining on large, unrelated source data (e.g., natural scene video). For instance, Kinetics-400 video pretraining exposes 3D models to generic spatiotemporal cues, akin to radiologists scrolling through slice stacks, enabling feature hierarchies that transfer into pulmonary embolism (PE) and nodule detection when subsequently fine-tuned on chest CT (Ke et al., 2023). Similarly, large-scale CT collections can be exploited using intra-volume or cross-modal contrastive learning, producing label-agnostic embeddings (Pai et al., 15 Jan 2025, Claessens et al., 21 Nov 2025, Veenboer et al., 30 Nov 2025).
2. Architectural Adaptations and Pretraining Objectives
TAP-CT systems span diverse architectures, unified by targeted adaptations for volumetric data and the demands of self-supervised learning.
- 3D ViT Transformations: TAP-CT extends 2D Vision Transformer (ViT) architectures to the volumetric domain by dividing inputs into non-overlapping 3D patches, applying a depth-aware variant of positional encoding (e.g., trilinear interpolation of a 2D grid), and leveraging both global/local crops and cuboid masking for augmentation (Veenboer et al., 30 Nov 2025). Token count and computational costs are mitigated through hierarchical attention (local/global windows) and anisotropic patching (Claessens et al., 21 Nov 2025).
- Convolutional and Hybrid Backbones: CNN-based architectures such as 3D ResNets, SegResNets, or specialized modules (e.g., R(2+1)D, PENet) have been equipped with self-supervised pretraining heads, multi-instance learning (MIL) aggregators, or sequence-processing blocks for CT data (Ke et al., 2023, Jung et al., 22 Jan 2025).
- Projection and Alignment Heads: Foundation models employ projection MLPs for mapping features into contrastive embedding spaces, as well as dedicated heads for downstream classification or segmentation, with parameter initialization and training schedules designed to separate representation learning from task-specific adaptation (Jung et al., 22 Jan 2025, Pai et al., 15 Jan 2025, Claessens et al., 21 Nov 2025).
Pretraining objectives in TAP-CT are unified by being label-agnostic:
| Method | Core Objective | Loss Function(s) |
|---|---|---|
| TAP-CT (ViT+SSL) | Self-distillation (DINO), masked token rec. | |
| MEDFORM | SimCLR slice SSL, cross-modal contrastive | |
| CT-FM | Intra-sample SimCLR (patch contrastive) | |
| SPECTRE | DINO-style SSL (local), SigLIP VLA (global) |
Losses are often derived from cross-entropy over temperature-scaled similarities (NT-Xent/InfoNCE) or KL divergence between teacher and student outputs across augmented views. Architecture-specific regularization (e.g., KoLeo for prototype diversity) is optionally applied (Veenboer et al., 30 Nov 2025, Claessens et al., 21 Nov 2025, Jung et al., 22 Jan 2025, Pai et al., 15 Jan 2025).
3. Pretraining Data and Augmentation Pipelines
TAP-CT methodologies scale along two main data axes: (a) massive unlabeled CT corpora, and/or (b) large, out-of-domain video/image datasets.
- CT Volumes: Models have been pretrained on up to 148,000 3D CT volumes spanning thousands of patients and dozens of anatomical and disease cohorts (e.g., NLST, LIDC-IDRI, TCGA, AMOS, INSPECT, FDG-PET-CT), with volumes preprocessed by intensity clamping (e.g., [–1000, +1000] HU), resampling to isotropic/anisotropic voxel spacing, and normalization (Pai et al., 15 Jan 2025, Veenboer et al., 30 Nov 2025, Claessens et al., 21 Nov 2025). Patch extraction and multi-scale cropping/augmentation (resize, random crop, flips, blur, intensity jitter) are universally adopted.
- Video Corpora and Multimodal Data: Kinetics-400 (306,245 human-action video clips) is used for pretraining 3D models, leveraging the analogy between temporal structure in video and axial progression in CT (Ke et al., 2023). In multimodal frameworks (e.g., MEDFORM, SPECTRE), radiology reports or clinical numeric data are processed via LLM-based or MLP encoders, with cross-modal contrastive pairing against CT embeddings (Jung et al., 22 Jan 2025, Claessens et al., 21 Nov 2025).
- Augmentation Strategies: Photometric and volumetric augmentations are critical for invariance—masking ratios, gamma augmentation, and blur are tuned. DINO/DINOv2-style pipelines apply global and local views, masking up to 50% of patches, and GPU-accelerated pipelines ensure scalability (Veenboer et al., 30 Nov 2025, Claessens et al., 21 Nov 2025).
4. Downstream Transfer and Evaluation
TAP-CT establishes a benchmark for the transferability and universality of learned CT representations. Models are evaluated via frozen-feature linear probe or minimal adaptation across multiple tasks.
- Segmentation: Linear heads trained on top of frozen 3D ViT features achieve macro-averaged Dice scores up to 0.724 (AMOS22, (Z,512,512) resolution), exceeding prior baselines (e.g., Curia) by +9.3–23 pp. on the same protocol (Veenboer et al., 30 Nov 2025). Whole-body segmentation with CT-FM achieves mean Dice up to 0.9058 after fine-tuning, with >90% of anatomical regions improved compared to supervised pretraining (Pai et al., 15 Jan 2025).
- Classification: CT-FM and TAP-CT demonstrate AUCs of 0.876 (LUNA16), 0.855 (LUNA25), and performance matches or outperforms prior methods for nodule malignancy, trauma detection, and PET/CT tumor classification (Veenboer et al., 30 Nov 2025, Pai et al., 15 Jan 2025).
- Retrieval and Zero-shot: SPECTRE, using SigLIP for vision-language alignment, offers Recall@5 of 17.5% (vs 2.9% for CT-CLIP) on text-to-CT retrieval (Claessens et al., 21 Nov 2025). CT-FM clusters embeddings anatomically and supports content-based retrieval and patch-level semantic search (Pai et al., 15 Jan 2025).
- Robustness and Few-shot: TAP-CT and MEDFORM report stable feature test-retest, strong low-shot transfer (e.g., AUROC ≈ 0.659 with k=5 breast T-stage labels) (Jung et al., 22 Jan 2025, Pai et al., 15 Jan 2025).
- Comparison to 2D/2.5D: 3D TAP-CT models not only outperform randomly initialized 3D baselines, but also surpass strong 2D slice-based models (e.g., Swin-T outperforms LRCN by +0.118 AUROC on RSNA PE detection at full data) (Ke et al., 2023, Veenboer et al., 30 Nov 2025).
5. Implementation Details and Model Release
Implementations adapt DINOv2-style dual-view training for 3D, with volumetric patching, trilinear positional encodings, heavy augmentation, and 3D masking. Hyperparameters include batch sizes up to 2048 (H100 nodes), cosine decayed learning rates, AdamW optimizers, variable weight decay, drop-path, and layerwise learning rate decay (Veenboer et al., 30 Nov 2025). Training regimes often exceed 3,000 GPU hours. Model checkpoints, configuration files, feature extractors, and evaluation scripts have been open-sourced for full reproducibility and benchmarking (see https://huggingface.co/fomofo/tap-ct-b-3d) (Veenboer et al., 30 Nov 2025).
6. Impact, Comparative Analysis, and Future Directions
TAP-CT frameworks have established a paradigm shift in CT model development, advocating for pretraining on very large, often out-of-domain or cross-modal datasets to build feature hierarchies decoupled from any explicit diagnostic endpoint. Empirical results confirm that out-of-domain video pretraining, unsupervised CT contrastive learning, and cross-modal alignment not only avoid overfitting to small, labeled CT datasets, but also surpass both scratch and in-domain supervised pretraining—even as fine-tuning set size increases (Ke et al., 2023, Claessens et al., 21 Nov 2025, Veenboer et al., 30 Nov 2025).
Comparative ablation confirms the efficacy of intra-sample contrastive learning, hierarchical attention for scaling transformers, and MIL or ABMIL heads for volumetric aggregation. TAP-CT approaches demonstrate robust anatomy-aware clustering, semantic interpretability, and genericity across distinct pathologies and endpoints (Pai et al., 15 Jan 2025, Claessens et al., 21 Nov 2025, Veenboer et al., 30 Nov 2025). A plausible implication is a reduced reliance on expert annotation and an improved baseline for low-resource settings.
Prospective research will likely extend these methods to richer multimodal contexts (e.g., fusing imaging with genomics or longitudinal EHR), refine pretraining losses and augmentations for atypical scanning protocols, and further optimize computational efficiency for deployment on clinical infrastructure.
Selected References
- “Video Pretraining Advances 3D Deep Learning on Chest CT Tasks” (Ke et al., 2023)
- “MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis” (Jung et al., 22 Jan 2025)
- “Vision Foundation Models for Computed Tomography” (Pai et al., 15 Jan 2025)
- “Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers” (Claessens et al., 21 Nov 2025)
- “TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models” (Veenboer et al., 30 Nov 2025)