RenalCLIP: CT Vision-Language Model for Kidney Cancer
- RenalCLIP is a disease-centric CT and language model that integrates renal-specific imaging and radiology report data to enhance kidney cancer diagnosis.
- It delivers consistent performance across 10 clinical tasks including anatomical assessment, malignancy classification, and survival prediction, outperforming baseline models.
- The model supports zero-shot diagnosis, cross-modal retrieval, and report generation, significantly reducing reliance on extensive annotated datasets.
RenalCLIP is a disease-centric, CT-based vision-language foundation model purpose-built for precision oncology in kidney cancer. It was developed and validated using 27,866 CT scans from 8,809 patients across nine Chinese medical centers and The Cancer Imaging Archive (TCIA), and is designed to support characterization, diagnosis, and prognosis of renal masses through a two-stage pre-training strategy that first enhances image and text encoders with domain-specific knowledge and then aligns them through a contrastive learning objective (Tao et al., 22 Aug 2025). In the reported evaluation, RenalCLIP delivered consistent, externally generalizable performance across 10 core clinical tasks spanning anatomical assessment, diagnostic classification, and survival prediction, and also supported zero-shot diagnosis, cross-modal retrieval, and radiology report generation (Tao et al., 22 Aug 2025).
1. Clinical motivation and intended role
The model addresses a specific problem in urologic oncology: the non-invasive assessment of increasingly incidentally discovered renal masses. Diagnostic uncertainty in this setting frequently leads to overtreatment of benign or indolent tumors. The reported clinical context is that up to 20% of surgically resected renal masses prove benign, with associated unnecessary morbidity, cost, and loss of renal function (Tao et al., 22 Aug 2025). Biopsy is described as invasive, often non-diagnostic, and vulnerable to missing biologic aggressiveness due to intratumoral heterogeneity (Tao et al., 22 Aug 2025).
RenalCLIP is framed as a response to the limitations of general-purpose CT foundation models, which are described as lacking the fine-grained, renal oncology-specific knowledge needed to reliably distinguish indolent from aggressive tumors (Tao et al., 22 Aug 2025). Its disease-centric formulation is based on injecting renal-specific semantics from radiology reports into the image encoder and aligning CT images with clinical text in a shared embedding space (Tao et al., 22 Aug 2025). This suggests that the model is not intended as a generic radiology representation learner, but as a kidney cancer–specific system optimized for semantically nuanced discrimination across the preoperative workflow.
Its intended clinical impact spans three linked functions. First, it targets automated anatomical assessment through the R.E.N.A.L. nephrometry score to standardize surgical planning. Second, it targets diagnostic classification to reduce overtreatment of benign or indolent masses. Third, it targets recurrence-free, disease-specific, and overall survival prediction to inform personalized follow-up and adjuvant therapy decisions (Tao et al., 22 Aug 2025). The paper further emphasizes zero-shot and data-efficient learning as mechanisms to reduce dependence on scarce expert annotations and to streamline clinical integration (Tao et al., 22 Aug 2025).
2. Data resources, cohorts, and annotation pipeline
The total dataset comprises 27,866 preoperative CT scans from 8,809 patients. Data were collected from nine Chinese medical centers plus TCIA (Tao et al., 22 Aug 2025). Pre-training used four centers: Zhongshan Hospital Fudan University, First Affiliated Hospital of Zhejiang University, Qilu Hospital of Shandong University, and Linyi City People’s Hospital. External validation used Xiamen Branch of Zhongshan Hospital, Shandong Cancer Hospital, Ruijin North Hospital, First People’s Hospital of Lianyungang, Zhangye People’s Hospital, and TCIA (Tao et al., 22 Aug 2025).
The inclusion criterion was preoperative multi-phase CT with at least one contrast-enhanced phase. Exclusion criteria were neoadjuvant therapy or embolization, hereditary RCC, and severe CT artifacts (Tao et al., 22 Aug 2025). The pre-training dataset contained 6,867 patients and 21,819 scans, split into 6,367 patients with 20,278 scans for training and 500 patients with 1,541 scans for validation. The downstream fine-tuning and evaluation dataset contained 1,942 patients and 6,047 scans, including an internal cohort of 400 patients with 1,240 scans for training and 100 patients with 316 scans for validation, five external proprietary cohorts, and a TCIA cohort of 425 patients with 1,047 scans (Tao et al., 22 Aug 2025).
CT acquisition was handled as a multiphase problem involving non-contrast, arterial, venous, and delayed phases. All phases were co-registered to the arterial-phase grid, or venous if arterial was unavailable, using linear interpolation in 3D Slicer BRAINSResample (Tao et al., 22 Aug 2025). For primary downstream evaluations, single-phase inputs prioritized arterial and then venous phases for fair comparison. In ablation, late fusion averaged logits across available phases at inference after single-phase training (Tao et al., 22 Aug 2025). The paper notes that slice thickness, vendor and reconstruction kernels, and kVp/mAs were not specified, although all volumes were resampled to a common grid (Tao et al., 22 Aug 2025).
The preprocessing pipeline was centered on region-of-interest localization. Initial segmentation used nnU-Net to segment kidneys, cysts, and tumors (Tao et al., 22 Aug 2025). In pre-training, the ROI center was the geometric center of any segmented foreground, or, if no mask was available, a single manual point at the kidney center by a radiologist. In downstream tasks, the ROI center was defined on the axial slice with the largest primary lesion cross-section, with radiologist review of nnU-Net outputs and manual correction when needed (Tao et al., 22 Aug 2025). The cropped 3D volume was centered on the ROI and sized to 140 × 140 × 160 mm, then resampled to 1.0 × 1.0 × 5.0 mm, yielding an input of 140 × 140 × 32 voxels. HU windowing used level 50 HU and width 500 HU, followed by intensity normalization (Tao et al., 22 Aug 2025). Training augmentation included random crop to 128 × 128 × 32, affine transforms, intensity scaling and shifts, gamma contrast, and task-dependent horizontal flipping; horizontal flip was excluded during pre-training to preserve left/right semantics (Tao et al., 22 Aug 2025).
The annotation pipeline combined radiology reports, report-derived structured attributes, pathology, and survival outcomes. Radiology reports were in native Chinese, split into left and right kidney descriptions using an LLM prompt, and then programmatically translated to English. Reports were available for all cohorts except Lianyun and TCIA (Tao et al., 22 Aug 2025). Fourteen key imaging questions with options, including location, size, cystic versus solid, enhancement pattern, necrosis, margins, likely benign or malignant, and invasion, were extracted with GPT-4o-guided parsing into one-hot labels for image encoder pre-training (Tao et al., 22 Aug 2025). Malignancy and aggressiveness labels were defined using pathology and a curated taxonomy that mapped WHO/ISUP grade, invasion, necrosis, sarcomatoid features, and stage; subtypes included ccRCC, papillary RCC, chromophobe, oncocytoma, AML, and cystic entities (Tao et al., 22 Aug 2025). Stage used AJCC 8th edition, grade used WHO/ISUP, and for available slides, three pathologists re-reviewed subtypes and grades using WHO 2022 criteria (Tao et al., 22 Aug 2025). Outcomes were recurrence-free survival, disease-specific survival, and overall survival (Tao et al., 22 Aug 2025). Label quality relied on expert curation and re-review, but formal inter-rater reliability metrics were not reported (Tao et al., 22 Aug 2025).
3. Architecture and training methodology
RenalCLIP couples a 3D image encoder with an LLM-based text encoder (Tao et al., 22 Aug 2025). The image encoder uses a 3D ResNet-18 backbone adapted for volumetric CT, with 35,087,424 parameters in the backbone used for downstream fine-tuning (Tao et al., 22 Aug 2025). Training uses 3D patches of 128 × 128 × 32 voxels, center-cropped at inference. A projection head maps the backbone features to a global one-dimensional embedding aligned with the text feature dimension for contrastive learning (Tao et al., 22 Aug 2025).
The text encoder is based on Llama 3 with 8B parameters, 32 transformer layers, embedding dimension 4096, and hidden dimension 14,336 (Tao et al., 22 Aug 2025). It is adapted through LLM2Vec, which replaces causal attention with a bidirectional structure and returns sentence embeddings by averaging the final hidden states (Tao et al., 22 Aug 2025). During cross-modal contrastive pre-training, the text encoder is kept frozen and no additional projection head is added (Tao et al., 22 Aug 2025).
The pre-training pipeline has two stages. Stage 1 is uni-modal knowledge enhancement. For the image encoder, this is implemented as multi-task supervised learning on the 14 report-derived imaging attributes using cross-entropy loss (Tao et al., 22 Aug 2025). For the text encoder, the process proceeds sequentially in two LoRA-based steps: masked language modeling trained on MIMIC-CXR reports to improve medical language understanding, followed by SimCSE contrastive sentence embedding to refine the closeness of semantically similar radiology text (Tao et al., 22 Aug 2025). Stage 2 performs vision-language alignment with a CLIP-style symmetric InfoNCE loss on image-report pairs (Tao et al., 22 Aug 2025).
The similarity function is cosine similarity,
and the symmetric contrastive objective over image-text pairs with learnable temperature is
with losses applied in both image-to-text and text-to-image directions (Tao et al., 22 Aug 2025).
Training details are reported explicitly. Image pre-training used AdamW, batch size 300, learning rate , 200 epochs, cosine schedule with 10% warmup, and a single A100 80GB GPU, with best checkpoint selected by macro-AUC over 14 attributes (Tao et al., 22 Aug 2025). Text pre-training used LLM2Vec with MLM at batch 32 and learning rate for 1,000 iterations, then SimCSE at batch 128 and learning rate for 1,000 iterations, both with LoRA, bf16, and gradient checkpointing (Tao et al., 22 Aug 2025). Cross-modal pre-training used four A100 80GB GPUs, per-GPU batch size 1,024 in bf16, AdamW with , backbone learning rate , projection-head learning rate , weight decay 0, cosine schedule with 10% warmup, 100 epochs, and gradient clipping norm 0.2, with best checkpoint chosen by validation retrieval recall (Tao et al., 22 Aug 2025). Downstream fine-tuning used a single A100 80GB GPU for 100 epochs with AdamW and hyperparameter selection by grid search over learning rate and batch size on the internal validation set (Tao et al., 22 Aug 2025).
4. Clinical task formulation
The evaluation protocol spans 10 core tasks across the kidney cancer workflow (Tao et al., 22 Aug 2025). Five characterization tasks correspond to the R.E.N.A.L. nephrometry score components: Radius, Exophytic/endophytic, Nearness to collecting system, Anterior/posterior, and Location relative to polar lines (Tao et al., 22 Aug 2025). Radius is categorized as 1 cm, 4–7 cm, or 2 cm; Exophytic/endophytic as 3 exophytic, 4 exophytic, or 100% endophytic; Nearness as 5 mm, 4–7 mm, or 6 mm; and Anterior/posterior as anterior, posterior, or neither (Tao et al., 22 Aug 2025).
The diagnostic component contains two binary tasks: malignancy classification, defined as benign versus malignant mass, and aggressiveness classification, defined as indolent versus aggressive using criteria such as invasion, tumor thrombus, high grade, and necrosis (Tao et al., 22 Aug 2025). These tasks operationalize the paper’s central claim that semantic alignment with radiology reports can help distinguish biologically consequential tumor states rather than only histologic subtype labels.
The prognostic component contains three survival tasks: recurrence-free survival, disease-specific survival, and overall survival (Tao et al., 22 Aug 2025). Recurrence-free survival is defined as time from surgery to local or distant recurrence or RCC-related death; disease-specific survival is time from surgery to death from RCC; and overall survival is time from surgery to death from any cause (Tao et al., 22 Aug 2025). The prognostic head is trained using a Cox proportional hazards model on recurrence-free survival only, and the resulting risk score is then applied to all endpoints (Tao et al., 22 Aug 2025). The partial log-likelihood is
7
where 8 is the event indicator and 9 is the risk set (Tao et al., 22 Aug 2025). Concordance is measured using the C-index,
0
where 1 is the set of comparable pairs and 2 is the predicted risk score (Tao et al., 22 Aug 2025).
This task design places RenalCLIP across characterization, diagnosis, and prognosis rather than limiting it to a single endpoint. A plausible implication is that the shared representation is intended to encode both anatomical and semantic features sufficiently broadly to support multiple downstream decision points from the same preoperative CT input.
5. Performance, external generalization, and comparative evaluation
Across characterization tasks, RenalCLIP achieved the best macro-averaged AUCs on internal validation for Radius (0.908), Exophytic/endophytic (0.646), Anterior/posterior (0.857), and Location relative to polar lines (0.747), and was competitive on Nearness (0.713 versus CNN 0.723) (Tao et al., 22 Aug 2025). On the combined external cohorts, it was best on Radius (0.902), Exophytic/endophytic (0.610), Nearness (0.715), and Location (0.727), and competitive on Anterior/posterior (0.754 versus CT-FM 0.757) (Tao et al., 22 Aug 2025). On TCIA, it was described as consistently superior or highly competitive, with Radius AUC 0.920 reported as an example (Tao et al., 22 Aug 2025).
For malignancy diagnosis on the combined external cohorts, RenalCLIP achieved AUC 0.841, PR AUC 0.941, sensitivity 0.827, specificity 0.735, and F1 0.876 (Tao et al., 22 Aug 2025). The margin over the strongest baselines increased under external validation, with the paper reporting +13.8% AUC relative to CT-FM (0.841 versus 0.739) and +17.3% relative to CNN (0.841 versus 0.717) (Tao et al., 22 Aug 2025). On TCIA malignancy classification, RenalCLIP obtained AUC 0.680, outperforming CT-FM at 0.609, CNN at 0.586, Merlin at 0.593, and CT-CLIP at 0.593 (Tao et al., 22 Aug 2025).
For aggressiveness, the combined external result was AUC 0.703, PR AUC 0.460, sensitivity 0.713, specificity 0.613, and F1 0.506, with a consistent advantage across cohorts (Tao et al., 22 Aug 2025). On TCIA aggressiveness classification, RenalCLIP achieved AUC 0.661, the best reported value (Tao et al., 22 Aug 2025). The paper highlights a clinically oriented result: in TCIA, only RenalCLIP’s aggressiveness predictions significantly stratified recurrence-free survival, with hazard ratio 2.23 and 3, whereas the baselines did not (Tao et al., 22 Aug 2025).
Prognostic performance is reported using C-index. For recurrence-free survival, RenalCLIP achieved 0.864 internally, 0.671 on the combined external cohorts, and 0.726 on TCIA (Tao et al., 22 Aug 2025). On TCIA, the improvement over the best baseline, CT-FM, was reported as 22.6% for recurrence-free survival (0.726 versus 0.592), 6.3% for disease-specific survival (0.690 versus 0.649), and 4.3% for overall survival (0.650 versus 0.623) (Tao et al., 22 Aug 2025). Kaplan–Meier analysis showed clear and significant recurrence-free survival separation on TCIA with 4 and hazard ratio 3.7, and the RenalCLIP risk score remained independently prognostic after adjustment for TNM stage and WHO/ISUP grade in multivariate Cox analysis, with hazard ratio 2.27 and 5 (Tao et al., 22 Aug 2025). Time-dependent analysis on TCIA further showed consistent leadership in time-dependent AUC, C-index, and Brier score across recurrence-free survival, with similar advantage for disease-specific survival and overall survival during most of follow-up (Tao et al., 22 Aug 2025).
The comparative baselines included CNN as a randomly initialized 3D ResNet-18 image-only model, CT-FM as a SegResNet-based SimCLR variant pre-trained on 148k whole-body CTs, CT-CLIP as a 3D ViT with text encoder pre-trained on 50k chest CTs, and Merlin as a 3D abdomen vision-LLM trained on approximately 15k abdominal CTs with EHR codes and reports (Tao et al., 22 Aug 2025). The reported pattern is that RenalCLIP’s margins widened under external validation, which the paper interprets as stronger resilience to distribution shift than general CT foundation models (Tao et al., 22 Aug 2025). This suggests that disease-specific semantic pre-training may be particularly beneficial when deployment settings differ from the development environment.
6. Zero-shot capability, retrieval, report generation, and data efficiency
Beyond supervised fine-tuning, RenalCLIP supports cross-modal retrieval, zero-shot diagnosis, and radiology report generation (Tao et al., 22 Aug 2025). For retrieval, text-to-image and image-to-text tasks were performed using cosine similarity in the shared embedding space and evaluated with Recall@1, Recall@3, and Recall@5 on cohorts with reports (Tao et al., 22 Aug 2025). The paper reports strong improvements across five cohorts; for example, the mean text-to-image Recall@5 was 0.120 for RenalCLIP versus 0.051 for Merlin and 0.004 for CT-CLIP (Tao et al., 22 Aug 2025).
Zero-shot diagnosis used a prompt ensemble with 20 sentence templates and 5 class phrases per class, yielding 100 prompts per class (Tao et al., 22 Aug 2025). Two evaluation strategies were used: a deterministic maximum-similarity ensemble and stochastic prompt sampling with bootstrap (Tao et al., 22 Aug 2025). On TCIA malignancy classification, the zero-shot AUC was 0.730 for RenalCLIP, compared with 0.664 for Merlin and 0.469 for CT-CLIP. On TCIA aggressiveness classification, the corresponding values were 0.657, 0.591, and 0.412 (Tao et al., 22 Aug 2025). The paper states that RenalCLIP was robust across strategies and produced the strongest zero-shot AUCs on internal, external, and TCIA datasets (Tao et al., 22 Aug 2025).
Data efficiency is a central property of the model. Fine-tuning with 20–40% labeled data achieved or exceeded the fully fine-tuned performance of baselines across malignancy and aggressiveness (Tao et al., 22 Aug 2025). For malignancy, the paper states that RenalCLIP’s zero-shot performance surpassed the peak of all baselines even after their full-data fine-tuning, and that with only 20% of the data, RenalCLIP already matched or exceeded baseline peaks (Tao et al., 22 Aug 2025). This is consistent with the abstract’s statement that in the diagnostic classification task it required only 20% training data to achieve the peak performance of all baseline models even after those baselines were fully fine-tuned on 100% of the data (Tao et al., 22 Aug 2025).
For report generation, the architecture consists of a frozen RenalCLIP image backbone, a linear projector, and a BioMistral-7B LLM fine-tuned with LoRA (Tao et al., 22 Aug 2025). Training proceeds in two steps: first aligning the projector with frozen encoders on the pre-training dataset, then fine-tuning the projector and LoRA parameters on the internal training cohort, using an instruction-based prompt to produce Findings and Impression (Tao et al., 22 Aug 2025). The decoding strategy follows standard LLM generation, and RenalCLIP achieved the highest BLEU-1, BLEU-2, BLEU-4, METEOR, and ROUGE-L across internal and multiple external cohorts compared with RadFM, CT-CHAT, GPT-4o, and MedGemma-4B (Tao et al., 22 Aug 2025).
Ablation analysis provides insight into what drives these capabilities. Removing domain-specific image pre-training or text pre-training degraded performance across retrieval, zero-shot, and fine-tuning tasks (Tao et al., 22 Aug 2025). Image pre-training contributed the largest and most consistent gains, while renal-specific text pre-training synergistically improved cross-modal retrieval and deterministic zero-shot classification (Tao et al., 22 Aug 2025). The paper also notes a nuanced effect under stochastic prompt sampling, where text pre-training slightly increased prompt sensitivity (Tao et al., 22 Aug 2025). For multi-phase imaging, late fusion produced modest improvements over single-phase input in some settings, but no phase combination among A, AV, NAV, and NAVD dominated across tasks and cohorts; simple averaging sometimes diluted signal (Tao et al., 22 Aug 2025). The authors therefore characterize single-phase arterial input, or venous if arterial is unavailable, as a strong pragmatic baseline (Tao et al., 22 Aug 2025).
7. Limitations, interpretability, deployment, and future directions
The study is retrospective, and the paper states that prospective real-world validation is needed for regulatory adoption and clinical integration (Tao et al., 22 Aug 2025). It also identifies population and site bias, since pre-training data were predominantly drawn from Chinese centers, while noting demonstrated cross-ethnic performance on TCIA and the need for broader global validation including underrepresented ancestries (Tao et al., 22 Aug 2025). Phase variability and acquisition heterogeneity are described as potential noise sources, with learnable multi-phase fusion proposed as a promising extension (Tao et al., 22 Aug 2025).
Interpretability is treated cautiously. The study focused on outcome-grounded validation through Kaplan–Meier curves, time-dependent metrics, and multivariate Cox analysis, but explicit saliency or attention visualizations such as Grad-CAM were not reported (Tao et al., 22 Aug 2025). A common misconception would be to equate the reported clinical relevance with mechanistic explainability; the paper explicitly separates these issues by demonstrating prognostic utility without providing saliency-based attribution maps. It further identifies improved explainability as a natural future direction (Tao et al., 22 Aug 2025).
Other risks involve annotation and generation quality. Report-derived attributes depend on LLM extraction, and while pathologic re-review improved quality, inter-rater metrics were not reported (Tao et al., 22 Aug 2025). Report generation, despite outperforming the included baselines, still shows occasional factual inconsistencies, so human oversight remains essential (Tao et al., 22 Aug 2025). These caveats delimit the scope of the current evidence: the model is reported as strong in quantitative benchmarking, but not as a fully autonomous clinical system.
Practical deployment details are also specified. The 3D ResNet-18 backbone has approximately 35M parameters and uses 128 × 128 × 32 inputs, which the paper characterizes as enabling feasible inference on modern GPUs, although exact latency is not reported (Tao et al., 22 Aug 2025). The implementation uses standard PyTorch and MONAI pipelines, and ROI centering can be automated with nnU-Net followed by lightweight radiologist quality assurance (Tao et al., 22 Aug 2025). Code, pretrained weights, and tutorials are available at the project repository, while training details, hyperparameters, prompts, and augmentation recipes are described as fully documented to support replication; the paper does not specify the license and directs readers to the repository for licensing information (Tao et al., 22 Aug 2025).
The future work identified in the paper includes radiogenomic prediction such as BAP1 and PBRM1, therapy response modeling, rare subtype detection, and improved explainability (Tao et al., 22 Aug 2025). Taken together, these directions indicate that RenalCLIP is positioned not merely as a benchmark model for renal mass classification, but as a kidney cancer–specific multimodal representation framework whose current evidence base is strongest for preoperative CT-driven characterization, diagnosis, prognosis, and associated zero-shot and generative tasks (Tao et al., 22 Aug 2025).