Medverse: Universal 3D Medical Image Analysis

Updated 4 July 2026

Medverse is a universal in-context learning model for 3D medical image analysis that unifies segmentation, transformation, and enhancement without retraining.
It employs a next-scale autoregressive framework and a Blockwise Cross-Attention Module to efficiently fuse long-range spatial context and preserve full-resolution details.
Evaluated over 27 diverse datasets, Medverse outperforms competitors in Dice score and PSNR, demonstrating robust generalizability and computational efficiency.

Medverse is a universal in-context learning model for full-resolution 3D medical image analysis that unifies segmentation, transformation, and enhancement within a single retraining-free framework. It is trained across diverse organs, imaging modalities, clinical centers, and task families, and is designed to address a specific limitation of prior medical ICL systems: the inability to preserve both full-resolution volumetric fidelity and global anatomical understanding at the same time. Its central technical contribution is a next-scale autoregressive in-context learning framework that predicts from coarse to fine, coupled with a Blockwise Cross-Attention Module for efficient long-range context–target interaction in 3D volumes (Hu et al., 11 Sep 2025).

1. Universal 3D in-context learning formulation

Medverse frames 3D medical image analysis as an in-context learning problem in which a target image is interpreted relative to a small set of context examples. The context consists of image–output pairs from other subjects, and the output type determines the task. When the context output is a binary mask, the task is segmentation. When it is another image, the task becomes transformation, such as skull stripping or arbitrary modality-to-modality transformation. When it is an improved version of the input, the task is enhancement, such as bias removal, inpainting, Gaussian noise removal, or salt-and-pepper noise removal (Hu et al., 11 Sep 2025).

This formulation is explicitly task-universal rather than segmentation-only. The same model is trained across 22 datasets covering universal image segmentation, transformation, and enhancement, and the final prediction is always a full volumetric output at the same spatial resolution as the input. The paper therefore treats task identity as something inferred from examples rather than encoded by a dedicated task token or task-specific retraining. In practical terms, Medverse is a 3D image-to-image model that uses example pairs as task prompts and coarse predictions as internal autoregressive prompts (Hu et al., 11 Sep 2025).

The model’s prediction at autoregressive step $t$ is written as

$\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$

where $\bm{x}^{(t)}$ is the target image at scale $t$ , $S^{(t)}$ is the semantic context, and $\mathcal{A}^{(t-1)}$ is the autoregressive context from the previous scale. This formalization is important because it makes Medverse different from a prompt-conditioned segmentation network: the model conditions simultaneously on task semantics from other subjects and on the target’s own previously predicted coarse structure (Hu et al., 11 Sep 2025).

2. Architecture: coarse-to-fine autoregression and blockwise fusion

Medverse is built around three 3D U-Net branches: an autoregressive context branch, a target image branch, and a semantic context branch. The semantic context branch processes standard ICL examples from other subjects. The autoregressive context branch processes the target image and its prediction from the previous, coarser scale. The autoregressive branch shares weights with the semantic context branch for parameter efficiency, while a learnable autoregressive embedding is added after the first layer to distinguish autoregressive context from semantic context (Hu et al., 11 Sep 2025).

The central architectural idea is next-scale autoregression. Instead of autoregressing over tokens or voxels, Medverse autoregresses over scales. It first predicts a coarse low-resolution output over the whole volume, then feeds that prediction back as context for a finer-scale prediction, continuing until original resolution is reached. If the input size is $(H, W, D)$ and the patch size is $I \times I \times I$ , the number of autoregressive steps is

$T=\Bigl\lceil \log_{2}\Bigl(\frac{\max\{H,W,D\}}{I}\Bigr)\Bigr\rceil + 1.$

This design allows the coarsest stage to capture whole-volume anatomy while later stages sharpen boundaries and restore fine detail (Hu et al., 11 Sep 2025).

At higher autoregressive steps, Medverse uses sliding-window processing, but unlike ordinary patchwise inference it does not treat each patch independently. For each target patch, it crops the semantic context at the same spatial location and extracts the corresponding region from the previous-scale autoregressive context, upsamples it, and uses it as conditioning input. This is the mechanism by which Medverse avoids the unstable predictions and stitching artifacts associated with direct high-resolution sliding-window inference (Hu et al., 11 Sep 2025).

The second major component is the Blockwise Cross-Attention Module. Context and target structures are not guaranteed to be spatially aligned, so simple concatenation or strictly local fusion is suboptimal. BAM therefore performs sparse block-level cross-attention. It partitions the feature volume into $p \times p \times p$ non-overlapping blocks, computes pooled query and key representations at block level, retains full unpooled value features, and performs attention over blocks rather than over all voxels. The resulting complexity is reduced from

$\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 0

$\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 1

where $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 2. In the implementation, BAM uses $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 3 tokens, adding $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 4 GFLOPs and $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 5 memory overhead per context pair, whereas standard full cross-attention with $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 6 tokens would require $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 7 GFLOPs and 2.55 memory overhead (Hu et al., 11 Sep 2025).

3. Data regime, tasks, and generalization protocol

The training corpus comprises 27 publicly available datasets with 40,362 3D scans in total. Of these, 22 datasets are used for training and validation with a random 9:1 split, while 5 datasets are held out entirely for generalization testing. The data span multiple imaging modalities, including T1, T2, FLAIR, MRA, DWI, ADC, PD, CT, and PET; multiple anatomical regions, including brain, abdomen, prostate, lung, and nasal structures; multiple clinical centers; and multiple species, with human data in training and mice as a held-out species condition (Hu et al., 11 Sep 2025).

The held-out evaluation is organized to test four distinct forms of generalization: unseen center, unseen organ, unseen species, and unseen modality. The specific held-out targets include unseen-center cerebral cortex, hippocampus, thalamus, liver, spleen, and kidney left; unseen-organ maxillary sinus, nasal cavity, and nasal pharynx; unseen-species mice lung; and unseen-modality PET lateral ventricle. Each held-out dataset is split 5:5 into a meta context set for selecting semantic context examples and a test set for evaluation. Segmentation masks are binarized into foreground and background (Hu et al., 11 Sep 2025).

This evaluation design is stricter than ordinary random-split reporting because the model is required to infer tasks and structures from context examples under genuine domain shift. The paper’s universality claim is therefore not limited to multi-organ training. It is a claim about fine-tuning-free adaptation across centers, organs, species, modalities, and task families within a unified 3D ICL formulation (Hu et al., 11 Sep 2025).

For losses, the model follows Neuroverse3D conventions. Segmentation uses a modified $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 8 loss, while enhancement and transformation apply $\hat{\bm{y}}^{(t)} = F(\bm{x}^{(t)},\, S^{(t)},\, \mathcal{A}^{(t-1)}),$ 9 both to image intensity and to intensity differences. The backbone is a five-stage 3D U-Net with initial channels 32, doubling at each stage, an input patch size of $\bm{x}^{(t)}$ 0, and BAM parameters $\bm{x}^{(t)}$ 1 and $\bm{x}^{(t)}$ 2 (Hu et al., 11 Sep 2025).

4. Empirical performance and ablation structure

The main reported result is that Medverse achieves the best average performance among the compared ICL systems for both segmentation and image restoration families. On segmentation, its average Dice is 87.27, compared with 81.10 for Neuroverse3D, 65.01 for UniverSeg, 63.34 for Neuralizer, and 41.68 for SegGPT. On transformation and enhancement, its average PSNR is 28.30, compared with 27.08 for Neuroverse3D, 22.34 for Neuralizer, and 13.22 for Painter (Hu et al., 11 Sep 2025).

Evaluation	Medverse	Comparator
Average segmentation Dice	87.27	Neuroverse3D 81.10
Average transformation/enhancement PSNR	28.30	Neuroverse3D 27.08
Segmentation with BAM+NA-ICL ablation	87.27	BAM only 78.90
Enhancement with BAM+NA-ICL ablation	29.66	BAM only 27.93

The segmentation gains are broad rather than isolated. Reported Dice scores include 87.30 for cerebral cortex, 82.12 for hippocampus, 87.65 for thalamus, 95.90 for liver, 91.05 for spleen, 95.31 for kidney left, 92.63 for maxillary sinus, 78.15 for nasal cavity, 87.13 for nasal pharynx, 92.21 for mice lung, and 70.48 for PET lateral ventricle. The paper notes that Medverse is especially strong relative to Neuroverse3D on unseen organ, unseen species, unseen modality, and abdominal unseen-center generalization, and that it reaches performance similar to few-shot task-specific models while requiring no fine-tuning (Hu et al., 11 Sep 2025).

For restoration-type tasks, Medverse reports 26.36 PSNR for skull stripping, 24.80 for modality transform, 28.48 for bias removal, 26.08 for Gaussian noise removal, 32.57 for salt-and-pepper denoising, and 31.51 for inpainting. The average gain over Neuroverse3D is modest at $\bm{x}^{(t)}$ 3 PSNR, but the paper emphasizes that Medverse preserves full input resolution while Neuroverse3D does not (Hu et al., 11 Sep 2025).

The ablation study isolates the two main innovations. Removing both BAM and NA-ICL yields 76.50 Dice, 24.53 transformation PSNR, and 27.45 enhancement PSNR. Adding BAM alone raises these to 78.90, 25.38, and 27.93, showing that long-range context–target fusion is beneficial. Adding NA-ICL on top of BAM produces the largest jump, to 87.27 Dice, 25.58 transformation PSNR, and 29.66 enhancement PSNR, confirming that coarse-to-fine autoregressive refinement is the main driver of strong full-resolution 3D performance (Hu et al., 11 Sep 2025).

Efficiency is a further empirical point rather than an afterthought. Through adaptive parallel-sequential context processing, Medverse maintains fixed inference memory usage of 9.14 GB regardless of context size. Reported inference speed is 1.16 seconds for a $\bm{x}^{(t)}$ 4 patch with 8 context samples and 1 autoregressive context, which is much faster than Painter and SegGPT in the reported comparison (Hu et al., 11 Sep 2025).

5. Terminological neighborhood and adjacent “Medverse-like” systems

The name “Medverse” sits within a crowded neighborhood of medical AI systems whose goals are related but not identical. Some focus on image interpretation, some on biomedical modality alignment, some on EHR fusion, some on executable medical knowledge, and some on knowledge maintenance.

System	Focus	Relation to Medverse
MedVersa	Generalist medical image interpretation	Multimodal inputs/outputs and LLM orchestration (Zhou et al., 2024)
BIOVERSE	Biomedical modality alignment to LLMs	Aligns scRNA-seq, proteins, and molecules into a shared LLM space (Tsou et al., 1 Oct 2025)
MEDFuse	EHR multimodal fusion	Fuses clinical notes with structured lab tests (Phan et al., 2024)
MedSumm	Multimodal code-mixed query summarization	Intake-layer text–image summarization for patient queries (Ghosh et al., 2024)
MedVLSynther	Synthetic medical VQA generation	Open literature–derived multimodal supervision pipeline (Huang et al., 29 Oct 2025)
The Medical Algorithms Project	Executable medical algorithms repository	Computable knowledge layer across 45 medical areas (0908.0932)
CMED	Contextualized medication event extraction	Clinical narrative event modeling along orthogonal dimensions (Mahajan et al., 2020)

MedVersa is the closest imaging counterpart in scope, but its architecture is different. It is a generalist foundation model for medical image interpretation that supports multimodal inputs, multimodal outputs, and on-the-fly task specification across 11 tasks and 3 modalities, with an LLM used as a learnable orchestrator that can either answer directly in language or invoke dedicated detection and segmentation modules (Zhou et al., 2024). Medverse, by contrast, is a universal 3D ICL model centered on coarse-to-fine volumetric prediction rather than orchestration across heterogeneous 2D and 3D tasks (Hu et al., 11 Sep 2025).

BIOVERSE addresses a different layer of the stack: representation alignment of biomedical foundation models to LLMs for scRNA-seq, proteins, and molecules. Its two-stage pipeline aligns pretrained modality encoders to a shared LLM token space and then instruction-tunes the system for zero-shot annotation, cross-modal question answering, and generative reasoning (Tsou et al., 1 Oct 2025). This suggests a molecular and omics analogue of the interoperability problem that Medverse solves in 3D imaging.

MEDFuse is relevant at the EHR layer. It combines fine-tuned clinical-text LLM embeddings with masked lab-test modeling and a disentangled transformer optimized by a mutual information loss to separate modality-specific and modality-shared information. On MIMIC-III and FEMH, it improves multi-label disease prediction over text-only LLM baselines, showing that structured and unstructured patient data require explicit fusion rather than naive concatenation (Phan et al., 2024).

The Medical Algorithms Project is an earlier but conceptually important precedent. It operationalized published medical knowledge as executable artifacts in a web-based repository of over 13,500 medical algorithms implemented mainly as Microsoft Excel spreadsheets across 45 areas of medical practice, with 106,907 registered users by March 1, 2009 (0908.0932). Although it is not an image model, it demonstrates a different sense in which a “Medverse” could function as a cross-specialty computation layer for medicine rather than a single predictive network.

CMED adds a clinical narrative perspective by formalizing medication change events along four orthogonal dimensions—Action, Temporality, Certainty, and Actor—and annotating 9,013 medication mentions over 500 clinical notes (Mahajan et al., 2020). MedSumm contributes a multilingual multimodal intake layer through MMCQS, a 3,015-sample dataset for summarizing code-mixed Hindi-English medical queries with accompanying images (Ghosh et al., 2024). MedVLSynther contributes an open literature route to multimodal supervision by generating and auditing 13,087 medical VQA questions over 14,803 images from PubMed Central Open Access content (Huang et al., 29 Oct 2025).

A separate orthographic collision is MedVersa in the MedREK paper, where the name denotes a Medical Versatile Knowledge Editing Dataset rather than an imaging or multimodal clinical system. That benchmark covers 20 medical subjects and evaluates single-edit and batch-edit medical LLM updating under locality constraints (Xia et al., 15 Oct 2025). It is terminologically close to Medverse but technically unrelated.

6. Limits, scope, and broader implications

The Medverse paper is explicit that its universality is strong but not absolute. The current model uses a relatively modest number of parameters, the number of training datasets is constrained by resources, and on transformation and enhancement tasks it still lags behind fully supervised 3D U-Net and remains below few-shot task-specific models in some settings. The authors therefore present Medverse not as a replacement for all supervised medical imaging pipelines, but as a proof that universal full-resolution 3D in-context learning can be made practical, scalable, and meaningfully generalizable (Hu et al., 11 Sep 2025).

Its main limitation is scope rather than internal inconsistency. Medverse is universal across several 3D image-to-image task families, but it is still a medical imaging model. It does not natively solve multimodal reasoning over laboratory data, clinical narratives, or molecular representations; nor does it provide the governance, provenance, or versioning machinery associated with executable medical knowledge repositories such as the Medical Algorithms Project (0908.0932). Likewise, it does not address the document-level contextualization problem studied in CMED, where medication events must be disambiguated by temporality, certainty, and actor, or the multilingual patient-intake problem studied in MedSumm (Mahajan et al., 2020, Ghosh et al., 2024).

This suggests that a broader “Medverse” in the platform sense would require several additional layers beyond the 3D imaging core. A plausible extension would combine coarse-to-fine volumetric ICL with LLM-based orchestration across imaging tasks, as in MedVersa (Zhou et al., 2024); shared-token alignment for biomolecular and omics modalities, as in BIOVERSE (Tsou et al., 1 Oct 2025); modality-aware EHR fusion, as in MEDFuse (Phan et al., 2024); executable and versioned medical knowledge objects, as exemplified by MEDAL (0908.0932); and open multimodal supervision pipelines of the kind demonstrated by MedVLSynther (Huang et al., 29 Oct 2025).

In that broader interpretation, Medverse is both a specific 3D medical imaging architecture and a useful focal point for thinking about medical AI universality. In the narrow technical sense, it denotes a next-scale autoregressive in-context learning model for full-resolution volumetric segmentation, transformation, and enhancement (Hu et al., 11 Sep 2025). In the wider systems sense, it suggests an eventual convergence of full-resolution image modeling, multimodal biomedical reasoning, contextual clinical NLP, executable medical algorithms, and open medical knowledge maintenance into a more integrated computational environment for medicine.