Med3DInsight: 3D Medical Pretraining

Updated 4 July 2026

Med3DInsight is a pretraining framework that transfers 2D slice-text semantics into volumetric encoders for CT and MRI.
It employs a Plane-Slice-Aware Transformer to align full 3D representations with 2D image and text embeddings, improving downstream segmentation and classification.
The framework integrates mini-batch partial optimal transport and reconstruction losses to boost data efficiency and model robustness.

Med3DInsight is a pretraining framework for 3D medical image understanding that transfers semantic supervision from 2D multimodal LLMs into volumetric encoders for CT and MRI. Its central design treats a 3D volume, a sampled 2D slice, and a slice-level text description as a training triplet, then aligns the 3D representation with 2D image and text embeddings through a Plane-Slice-Aware Transformer (PSAT). In the original formulation, Med3DInsight was introduced as a backbone-agnostic method for improving downstream 3D segmentation and classification by marrying existing 3D image encoders with 2D vision-LLMs; a later formulation retained this core idea while replacing the original alignment stage with partial optimal transport and an explicit reconstruction term, emphasizing scalable multimodal pretraining without human annotations (Chen et al., 2024, Chen et al., 11 Sep 2025).

1. Conceptual basis and problem setting

Med3DInsight addresses a specific weakness of volumetric medical representation learning: standard 3D CNN and transformer pipelines can achieve strong supervised performance, yet their features are not explicitly grounded in language and often require substantial labeled volumetric data. The original formulation argues that existing 3D medical encoders have limited semantic understanding and poor data efficiency, while the later formulation sharpens the critique by noting that many 3D self-supervised methods are dominated by low-level pixel- or patch-level objectives rather than high-level anatomical or pathological semantics (Chen et al., 2024, Chen et al., 11 Sep 2025).

The framework is motivated by the asymmetry between 2D and 3D multimodal learning. Vision-language systems such as CLIP and GPT-4V can connect images to text, but they are designed for 2D inputs. Directly applying them to 3D medical volumes is obstructed by dimensionality mismatch, a representation gap between volumetric and slice-level embeddings, the geometric dependence of any slice on its anatomical plane and slice position, and the scarcity of native 3D image-text corpora (Chen et al., 2024).

This places Med3DInsight in a broader line of work that reuses strong 2D models for volumetric analysis, but with a different objective. Slice-aggregation approaches such as attention pooling over all $H+W+D$ slices emphasize inspectable scan-level prediction, while later slice-transformer designs adapt 2D self-supervised encoders such as DINOv2 to stacks of 32 slices for classification and saliency analysis. Med3DInsight instead uses 2D image-text semantics as a pretraining signal for a general 3D encoder, rather than as the final inference architecture itself (Ziller et al., 2023, Müller-Franzes et al., 2024).

2. Triplet construction and pretraining pipeline

The original framework constructs training data as triplets

$D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$

where $V_k$ is a 3D volume, $I_k^{(i,j)}$ is a 2D slice from plane $i$ and slice index $j$ , and $T_k^{(i,j)}$ is a text description for that slice. The planes are coronal, sagittal, and axial. For each triplet, a 3D encoder produces

$h^V_k = f_V(V_k),$

a 2D image encoder produces

$h^I_{i,j,k} = f_I\!\left(I_k^{(i,j)}\right),$

and the text branch is intended to produce the slice-description embedding

$h^T_{i,j,k} = f_T\!\left(T_k^{(i,j)}\right).$

The paper notes that one printed expression for $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 0 appears to be a typo, since the surrounding text clearly ties the text feature to the generated description rather than to the image itself (Chen et al., 2024).

In the 2024 formulation, pretraining uses 3DSeg-8, a public collection of about 2K 3D medical images spanning MRI and CT and multiple body parts. All slices are extracted from all three planes, and one slice is sampled during each training iteration; GPT-4V generates a detailed description for that slice, and CLIP image and text encoders supply a shared 2D vision-language embedding space. The paper states that GPT-4V and CLIP are frozen during Med3DInsight pretraining, although the exact CLIP fine-tuning schedule on generated slice-text pairs is not fully specified (Chen et al., 2024).

The later formulation makes the triplet-generation stage more explicit. It builds a corpus of 24,140 triplets from 3DSeg-8 and M3D, including MRI (4,587) and CT (18,330), and uses the GPT-4V prompt “Describe the image in fewer than 100 words.” It samples one 2D slice per volume, not multiple neighboring slices, on the grounds that adjacent slices are often redundant and that cross-volume diversity is more useful than dense within-volume sampling (Chen et al., 11 Sep 2025).

3. Plane-Slice-Aware Transformer

PSAT is the architectural bridge that lets Med3DInsight align a full 3D representation to a single 2D slice and its text description without collapsing volumetric context. Rather than globally pooling the volume feature or processing slices independently, PSAT uses learnable queries, self-attention among those queries, cross-attention from queries to the 3D volume feature, and an MLP projection head that maps the resulting query outputs into the same embedding space as the CLIP image and text features (Chen et al., 2024).

Its geometric specificity comes from plane-slice positional conditioning. The paper defines a plane-slice position embedding

$D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 1

where $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 2 is the embedding dimension, $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 3 is the number of planes, and $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 4 is the number of slices. This tensor is initialized with zero parameters. For a selected slice $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 5, only the embedding corresponding to that slice’s plane and slice position is injected into the 3D volume feature and the learnable queries. The practical purpose is to disambiguate which 2D view of the 3D anatomy is being aligned, so that a coronal slice near one end of the volume is not treated as equivalent to an axial slice through the middle (Chen et al., 2024).

The original experiments use $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 6 learnable queries with token dimension 512. In the later formulation, nnFormer is identified as the main 3D visual encoder paired with PSAT. Across both versions, the module is treated as a reusable adaptor rather than a replacement for the 3D backbone, which is why the framework is presented as being easily integrated into existing 3D medical image understanding networks (Chen et al., 2024, Chen et al., 11 Sep 2025).

4. Objectives, alignment strategy, and the later extension

The original Med3DInsight pretraining objective is described as a contrastive loss that aligns projected 3D volume features with both the sampled slice image embedding and the generated text embedding. However, the paper does not print the explicit contrastive formula, nor does it provide full self-attention or cross-attention equations for PSAT. Downstream segmentation is trained with cross-entropy loss and Dice loss, simply averaged, while downstream classification uses cross-entropy loss (Chen et al., 2024).

The later formulation makes the optimization problem explicit and changes the alignment mechanism. For a projected 3D embedding

$D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 7

it replaces ordinary contrastive matching with mini-batch partial optimal transport. The transport plan is defined as

$D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 8

where $D_k : \bigl(V_k, I_k^{(i,j)}, T_k^{(i,j)}\bigr),$ 9 is the transported mass, $V_k$ 0 is the ground cost, and $V_k$ 1 is entropic regularization. The ground metric is Mahalanobis rather than Euclidean, and the transport plans are computed by Bregman projection. A reconstruction term preserves low-level 3D information: $V_k$ 2 The total objective is

$V_k$ 3

This change is motivated by the fact that a single 2D slice is only a partial observation of a 3D volume and that GPT-generated descriptions can be noisy; partial optimal transport therefore relaxes the assumption that every matched pair must be fully and exactly aligned (Chen et al., 11 Sep 2025).

A plausible implication is that the 2025 variant resolves one of the main methodological gaps of the 2024 paper. The earlier version established the geometric bridge and the downstream benefit, whereas the later version formalized the multimodal alignment stage mathematically and treated noisy slice-text supervision as a first-class optimization problem rather than as a conventional contrastive pairing assumption (Chen et al., 2024, Chen et al., 11 Sep 2025).

5. Empirical performance and ablation evidence

The 2024 paper reports consistent gains across segmentation and classification benchmarks when Med3DInsight is used to pretrain the backbone. On MM-WHS cardiac segmentation, nnFormer improves from 85.9 average Dice to 88.6 with Med3DInsight; on CHAOS liver segmentation, nnFormer improves from 91.9 to 94.1; and on OASIS brain segmentation, nnFormer improves from 92.4 to 94.7. For Alzheimer’s disease classification on OASIS2, ViT improves from 80.5 Accuracy / 82.6 AUC to 81.4 / 84.3, while Swin-ViT improves from 82.8 / 84.4 to 84.1 / 85.7 (Chen et al., 2024).

The later formulation expands evaluation to ten segmentation datasets and two classification datasets. Averaged across the ten segmentation tasks, Med3DInsight reports $V_k$ 4 DSC and $V_k$ 5 HD95, compared with vox2vec at $V_k$ 6 DSC and $V_k$ 7 HD95. On OASIS2 classification it reaches $V_k$ 8 Accuracy and $V_k$ 9 AUC, and on PPMI it reaches $I_k^{(i,j)}$ 0 Accuracy and $I_k^{(i,j)}$ 1 AUC (Chen et al., 11 Sep 2025).

Setting	Comparison	Result
OASIS segmentation (2024)	nnFormer vs nnFormer + Med3DInsight	92.4 vs 94.7 Dice
CHAOS segmentation (2024)	nnFormer vs nnFormer + Med3DInsight	91.9 vs 94.1 Dice
OASIS2 classification (2024)	Swin-ViT vs Swin-ViT + Med3DInsight	82.8/84.4 vs 84.1/85.7
10-dataset segmentation average (2025)	vox2vec vs Med3DInsight	87.52/2.96 vs 88.59/2.49
OASIS2 classification (2025)	PCRLv2 vs Med3DInsight	84.71/87.92 vs 86.93/89.52

Ablation studies identify PSAT as the decisive architectural component. In the original OASIS ablation, direct projection without transformer or positional embedding yields 92.5 average Dice, adding the query transformer yields 93.9, and full PSAT with plane-slice position embedding yields 94.7. The later paper reports that reconstruction alone gives 91.79 average Dice, mPOT alignment alone gives 93.84, and the combined objective gives 94.75; contrastive alignment yields 93.87, while mPOT yields 94.75. It also reports that one slice per volume performs better than three or five, with 94.75 average Dice for one slice, 94.44 for three, and 91.96 for five (Chen et al., 2024, Chen et al., 11 Sep 2025).

6. Position in the literature, limitations, and outlook

Med3DInsight’s defining contribution is not a new volumetric backbone but a new pretraining recipe: use 2D slice-text semantics to improve 3D medical encoders while preserving 3D context. This makes it distinct from slice-based classifiers that aggregate 2D features for a final decision, and from purely volumetric self-supervised methods that never leave the 3D visual domain. This suggests that Med3DInsight occupies a specific middle ground between 2D multimodal supervision and native 3D representation learning (Ziller et al., 2023, Müller-Franzes et al., 2024, Chen et al., 11 Sep 2025).

The limitations are explicit. The 2024 paper leaves several components under-specified, including the exact contrastive loss, the detailed PSAT attention equations, and the CLIP fine-tuning protocol. It also relies on generated slice descriptions rather than native 3D text supervision. The later paper addresses the alignment objective more rigorously, but still depends on GPT-4V quality, notes that GPT-4V is not a true medical 3D expert, and retains the one-slice supervision strategy as a deliberate but restrictive design choice. It also raises broader concerns about incorrect interpretations, sensitive medical data, and the need for greater fine-grained semantic understanding (Chen et al., 2024, Chen et al., 11 Sep 2025).

The framework’s broader significance lies in its transferability. The original paper presents it as easily integrated into existing 3D medical image understanding networks, and the later paper shows that the same conceptual core can be reformulated with stronger alignment machinery while remaining annotation-free at pretraining time. Future directions stated in the later work include refining fine-grained 3D semantic understanding and integrating with LLMs to reduce noise in generated content, implying that Med3DInsight is best understood not as a fixed architecture but as a continuing line of multimodal 3D pretraining research (Chen et al., 2024, Chen et al., 11 Sep 2025).