Jolia: 3D CT Vision–Language Model
- Jolia is a 3D CT vision–language model that integrates chest and abdominal CT volumes with full radiology reports using detailed concept-level alignments.
- It employs ConQuer to perform mask-free, per-anatomical region contrastive learning, preserving the organ-wise structure of CT data.
- The model improves global representation and localized analysis, supporting applications such as findings classification, report generation, and exploratory localization.
Jolia is a 3D CT vision–language foundation model trained on chest and abdominal CT volumes paired with full radiology reports. Its central technical contribution is ConQuer (Concept Queries), a pretraining scheme that augments CLIP-style global image–text alignment with concept-level alignments, one per anatomical region, while remaining mask-free and using no spatial supervision. In the reported instantiation, concepts are anatomical regions, reports are decomposed into concept-specific sections, and the model is optimized with both a global contrastive loss and a set of per-concept contrastive losses. This design is intended to preserve the organ-wise structure of volumetric CT and long, multi-section radiology reports that is compressed by single-token global alignment (Khlaut et al., 23 Jun 2026).
1. Concept and positioning within CT vision–language pretraining
Jolia is formulated against the limitations of standard CLIP-style CT pretraining. In that regime, an entire 3D volume and an entire report are each encoded into a single global embedding, and pretraining is driven by a global contrastive objective. For radiology, this is a structurally coarse abstraction: one CT commonly spans dozens of organs and structures, and one report typically contains multiple organ-specific findings. A single image token aligned to a single text token therefore mixes anatomical and pathological content that is semantically local rather than globally homogeneous (Khlaut et al., 23 Jun 2026).
The motivating claim is not merely that CT reports are long, but that they are structured. Findings are often expressed organ by organ—lungs, liver, kidneys, mediastinum, retroperitoneum, and so forth—so global text pooling collapses what is effectively a set of localized supervision signals into one vector. Jolia addresses this by retaining the usual global alignment while introducing one alignment channel per concept. The resulting representation is intended to disentangle “which finding belongs where,” rather than forcing the model to encode all organs and abnormalities in a single latent summary.
A common misconception is that this kind of organ-aware alignment requires segmentation masks or an external segmenter. Jolia is explicitly presented as a mask-free alternative: it does not use RadSAM or a segmentation module, unlike approaches such as fVLM, TotalFM, and CT-GLIP. Instead, localization is induced by concept-wise contrastive learning itself. Another misconception is that Jolia replaces global CLIP training; it does not. The method preserves the global CLIP objective and adds concept-level supervision as an auxiliary constraint.
2. ConQuer: report decomposition, concept queries, and optimization
ConQuer operates in three stages: report processing, image-side concept pooling, and joint optimization. A concept is defined as any semantic unit intended for localization; in the reported experiments, concepts are anatomical regions. The paper defines a taxonomy of anatomical concepts, including individual organs, larger compartments, and vessels. Report structuring is performed with a two-step GPT‑5.2 pipeline. The first stage performs exam-type and taxonomy routing, detecting modality or subtype and mapping the study to a modality-specific organ and finding taxonomy. The second stage performs finding extraction and tagging: it splits the Findings section into atomic, single-sentence observations and tags each sentence with a finding_category and an organ. The finding categories come from 252 unified labels: 80 chest, 172 abdomen, and the organ tags are mapped by hand to the 102 concepts. Sentences assigned to the same concept are concatenated into a concept-specific text section; absent concepts are masked out of the concept-level loss (Khlaut et al., 23 Jun 2026).
This report parser is quantitatively validated on 50 chest + 50 abdomen reports, with precision/recall around 0.96/0.94 on chest and 0.98/0.95 on abdomen for finding extraction. That validation matters because Jolia’s concept-level supervision is only as reliable as the decomposition of reports into concept sections. The method therefore depends on an upstream language-processing layer, rather than on structured reports already present in the source datasets.
On the image side, the encoder produces patch tokens at one or more scales, together with a global token. For each scale , ConQuer instantiates learnable query tokens
A single-head cross-attention module is then applied with query and keys/values equal to the scale-specific patch tokens. After LayerNorm, the output for concept at scale is
For multi-scale encoders, the scale-specific outputs are concatenated into a concept embedding
Text features are produced by Qwen3‑Embedding‑8B, which is frozen during training. For each sample, the full report and every non-empty concept section are encoded into raw 4096-dimensional embeddings and then linearly projected to 576 dimensions, yielding a global text embedding 0 and concept embeddings 1. Training then combines a global CLIP loss with a concept-wise CLIP loss. For concept 2, similarity is defined by
3
where 4 is cosine similarity scaled by a learnable temperature specific to concept 5. The concept-level loss is the average of symmetric InfoNCE losses over concepts that have at least two valid samples in the batch: 6 The global loss is a standard symmetric InfoNCE objective on the 7 token and full-report embedding: 8 The total training loss is
9
with fixed weights 0 and 1. The intended effect is that global alignment remains primary while concept-level alignment imposes organ-wise disentangling.
3. Architecture, data, and training configuration
The main Jolia configuration uses the Atlas 3D transformer as its visual backbone. The input CT is resampled to 1.5 mm isotropic and cropped to 2 voxels. Atlas uses 3 patches, a 3-scale hierarchy, 192-dimensional patch tokens per scale, and a 576-dimensional global 4 vector formed by pooling and concatenating across scales. The backbone has approximately 21.8M parameters. ConQuer is also instantiated with a multi-scale 3D ResNet‑101 at 48.2M parameters and a single-scale 3D ViT‑B at 120M parameters. The cross-attention query mechanism adds only approximately 0.6M parameters, described as about 3% of Atlas and 1% of ResNet‑101 (Khlaut et al., 23 Jun 2026).
Jolia is pretrained on 74,434 public CT–report pairs drawn from CT‑RATE, INSPECT, and Merlin‑Abd‑CT. The dataset counts listed are: CT‑RATE: 24,128 train / 1,564 test; INSPECT: 23,248 train; and Merlin‑Abd‑CT: 20,357 train / 5,137 test. External datasets are used only for evaluation: EXT‑Chest‑CT: 30,873 train / 3,851 test with 80 findings, and EXT‑Abd‑CT: 6,503 train / 811 test with 172 findings. The paper emphasizes that Jolia uses only public datasets for pretraining, in contrast to several comparison systems that use larger private corpora.
Optimization is performed on 8 × NVIDIA H100 GPUs with global batch size 48 (6 per GPU) for 120,000 steps, described as 120 “epochs” of 1,000 steps. The optimizer is AdamW with weight decay 0.05, no grad clipping, and base learning rate 5 in bfloat16 mixed precision. The schedule is warmup–plateau–decay, with 8 epochs warmup, a stable plateau, and 8 epochs cooldown to 6. Data augmentation is not emphasized; the claimed novelty lies in the alignment mechanism rather than aggressive augmentation design.
4. Representations, interpretability, and operational use
Jolia exposes three frozen visual representations at inference time: the global token 7, the concept-level tokens 8, and the concatenated representation 9, described as the “finding-anchored representation” for concept 0 (Khlaut et al., 23 Jun 2026). This interface reflects the model’s dual design. Whole-volume tasks can use 1; organ-specific tasks can use 2; and purely localized analyses can focus on 3 alone.
Because each concept is realized as a cross-attention query over spatial tokens, the model produces concept-specific attention weights that can be visualized as 3D attention maps. The paper reports anatomically coherent attention for concepts such as liver, lungs, kidneys, colon, and hip regions, while also showing failure cases for pancreas and spleen, where attention spreads to neighboring organs. The stated explanation is anatomical: these organs are smaller, adjacent to other structures, and often associated with many normal findings, which weakens the contrastive signal. Importantly, interpretability is demonstrated qualitatively. No formal IoU-based localization metric is reported; the evidence consists of visualizations and qualitative checks against segmentation masks or radiologist judgment.
Operationally, the model is intended to support at least four downstream modes. First, it can be used as a CT feature extractor, with either global or organ-conditioned embeddings. Second, it supports linear probing and lightweight fine-tuning by mapping each finding to its associated concept and training a small head on 4. Third, it can serve as the visual front-end for report generation by projecting frozen Jolia features into a LLM. Fourth, the attention maps can be used for exploratory localization or as weak supervision for segmentation or detection. The released weights are hosted at https://huggingface.co/raidium/Jolia.
5. Empirical performance profile
The strongest results reported for Jolia are in findings classification, especially under linear probing. In the principal comparison averaged across Merlin‑Abd‑CT, EXT‑Abd‑CT, CT‑RATE, and EXT‑Chest‑CT, the reported AUROC values are 82.42 for the Baseline CLIP (Atlas), 83.12 for Jolia‑[CLS], and 84.12 for Jolia (Atlas, [CLS]+Query). The same table lists 82.92 for Pillar‑0‑Best, 80.69 for SPECTRE, 76.94 for COLIPRI, 78.45 for Merlin, 72.58 for CT‑FM, 70.73 for fVLM, and 70.35 for CT‑CLIP. By dataset, the Baseline CLIP to Jolia gains are 81.66 → 83.59 on Merlin‑Abd‑CT, 75.01 → 77.39 on EXT‑Abd‑CT, 85.48 → 86.44 on CT‑RATE, and 87.53 → 89.06 on EXT‑Chest‑CT (Khlaut et al., 23 Jun 2026).
The reported ablations are notable because they separate representation choice from training objective. For Atlas and ResNet‑101, adding ConQuer improves the performance of 5 alone, implying that concept-level supervision improves even the global token. The concatenated 6 representation is consistently best, while Query alone is often competitive and, for ResNet‑101, surpasses the CLIP baseline on all four datasets. A separate concept-granularity study reports average AUROC 82.27 for the baseline CLIP, 83.36 for K-means clusters (K=32), 83.83 for natural anatomical groupings (K=10), 83.90 for natural anatomical groupings (K=32), and 83.90 for the default fine organ-level (K=102). The stated conclusion is that any localization helps over global CLIP, anatomically meaningful concepts help slightly more, and gains saturate around K=32–102.
On cross-center transfer, the paper evaluates linear probes trained on one center and tested on another under a unified taxonomy. The reported average external AUROC values are 72.61 for the baseline CLIP, 77.05 for Jolia‑[CLS], and 75.88 for Jolia (Atlas), compared with 75.00 for Pillar‑0‑Best, 72.01 for SPECTRE, 71.25 for COLIPRI, 67.73 for Merlin, 61.79 for CT‑FM, 59.75 for CT‑CLIP, and 57.86 for fVLM. The result is somewhat non-intuitive: the best external average is achieved by Jolia‑[CLS], not by the concatenated representation. A plausible implication is that concept-aware pretraining improves the robustness of the global token itself, and that the most transportable representation under institution shift need not be the most localized one.
The zero-shot picture is more mixed. In the short zero-shot setting with 8 positive/negative template pairs and prompt averaging to avoid prompt cherry-picking, Jolia achieves 78.04 on Merlin‑Abd‑CT, which is the best value reported there, but only 62.06 on EXT‑Abd‑CT and 74.18 on CT‑RATE, where COLIPRI is best in chest zero-shot with 76.98 on CT‑RATE and 73.34 on EXT‑Chest‑CT. The paper attributes COLIPRI’s chest advantage to explicit training with zero-shot oriented text augmentations. In the long zero-shot setting, where prototypes are formed by averaging embeddings from 50 positive and 50 negative reports, Jolia reaches 83.03 on Merlin‑Abd‑CT, 72.52 on EXT‑Abd‑CT, 79.05 on CT‑RATE, and 79.40 on EXT‑Chest‑CT with Atlas. With ResNet‑101+ConQuer, the model reaches 82.69 on EXT‑Chest‑CT, which is reported as the best among all models.
For report generation, Jolia is paired with a Qwen3.5‑9B decoder in a two-stage pipeline: first training only the projector with both encoder and LLM frozen, then performing LoRA fine-tuning of the LLM and projector while keeping the encoder frozen. On 5,125 test exams from Merlin‑Abd‑CT, the reported metrics for Jolia are BLEU 0.119, ROUGE-L 0.323, BERTScore 0.567, RadGraph-F1 0.317, GREEN 0.324, and CRIMSON -0.194. The paper states that Jolia leads on 4/6 metrics, with a particularly large relative gain on RadGraph‑F1 over Merlin (in-domain), which scores 0.237. However, CRIMSON remains negative for all models, including Merlin’s -0.138, indicating persistent underreporting of positive findings.
For image–text retrieval, Jolia is described as competitive but not the main strength. On CT‑RATE, Jolia (Atlas, 7) reports R@1 I→T: 29.79 and T→I: 20.43, while the best chest retrieval is attributed to COLIPRI or SPECTRE depending on direction. On Merlin‑Abd‑CT, Jolia reports I→T R@1: 43.69 and T→I: 27.75; these values exceed several baselines but trail Merlin [64.00] and Pillar‑0‑Abd [48.90] in some directions. The explanation given is architectural rather than incidental: the text encoder is frozen and the training objective includes no retrieval-specific loss.
6. Limitations, open questions, and likely research trajectories
The limitations described for Jolia are methodological, not merely empirical. First, the gains from ConQuer are larger on abdomen than on chest. The reported interpretation is that chest reports typically cover fewer organs and are less detailed, so there is less concept structure to exploit. This suggests that the utility of concept-level alignment depends on the semantic granularity present in both the image domain and the reporting style (Khlaut et al., 23 Jun 2026).
Second, Jolia depends on an LLM-based report splitter, specifically a GPT‑5.2 pipeline. The paper explicitly notes that it is unclear how sensitive ConQuer is to the choice and quality of the splitter, and that this introduces a dependency and a potential bias source. Since the per-concept contrastive targets are derived from these structured report sections, any systematic parsing error can distort supervision at scale.
Third, the text backbone is a frozen Qwen3‑Embedding‑8B model that is not domain-finetuned for radiology. The paper notes that radiology reports are somewhat out-of-distribution for it, and that joint training or adaptation of the text encoder could further improve performance. Fourth, the 102 anatomical concepts are sufficient for chest and abdomen but do not cover all of medicine, and smaller or adjacent organs such as pancreas and spleen exhibit less precise attention. Fifth, although the method is argued to be extensible to whole-body CT, MRI, and concepts beyond anatomy, those extensions are not evaluated in the present work.
The report-generation results also expose a clinically relevant bias: Jolia, like comparison systems such as Merlin, tends to underreport positive findings and overproduce “normal” statements, as reflected by uniformly negative CRIMSON scores. This is not a marginal issue, because the model’s strongest gains are on representation learning rather than faithful, exhaustive generation.
Several future directions are stated explicitly. These include scaling ConQuer to whole-body CT with richer concept sets, extending concepts beyond anatomy to pathology categories and tumor subtypes, jointly training or fine-tuning text encoders, exploring cheaper report-structuring pipelines such as smaller LLMs or rule-based plus LLM hybrids, and using concept attention maps as weak labels for segmentation or detection. More broadly, Jolia suggests a shift in medical vision–language pretraining from purely global alignment toward structured contrastive objectives that preserve the internal organization of clinical data rather than compressing it away.