UniMedVL: Unified Medical Multimodal Model

Updated 4 July 2026

UniMedVL is a unified multimodal model that simultaneously processes medical images and texts for both understanding and generation tasks across diverse imaging modalities.
The model employs a progressive curriculum strategy with staged training (observation, knowledge, analysis) and a dual-path architecture to bridge the gap between image recognition and synthesis.
Empirical results show enhanced performance on medical benchmarks and generation metrics, highlighting the bidirectional knowledge sharing between visual and textual modalities.

UniMedVL is a medical unified multimodal foundation model introduced within the Observation-Knowledge-Analysis (OKA) paradigm as the analysis-level component of a broader framework for medical multimodal understanding and generation (Ning et al., 17 Oct 2025). It is explicitly presented as the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture, addressing a fragmentation that the paper identifies across data representation, feature integration, and task-level multimodal capabilities. In the paper’s formulation, UniMedVL is designed to process multimodal medical inputs and produce both textual and visual outputs, including reports, annotations, segmentation masks, and synthesized images, so that understanding and generation are no longer handled by separate systems (Ning et al., 17 Oct 2025).

1. Conceptual definition and problem setting

UniMedVL is defined against a specific diagnosis-workflow problem: existing medical AI systems often separate image understanding from image generation, so a model that can answer questions or write reports may still be unable to generate visual outputs, perform modality translation, or produce interleaved multimodal outputs (Ning et al., 17 Oct 2025). The paper characterizes this separation as a limitation in three dimensions: the data are mostly single-modal or weakly paired, learned features are not built through a principled progression from basic perception to cross-modal reasoning, and task formulations are siloed.

The OKA paradigm provides the paper’s organizing abstraction. “Observation” refers to foundational multimodal perception from curated image-text pairs, “Knowledge” refers to progressively introduced medical multimodal competence, and “Analysis” refers to the unified model that executes both understanding and generation (Ning et al., 17 Oct 2025). UniMedVL is therefore not presented as an isolated architecture, but as the final realization of a staged framework in which data curation and curriculum are prerequisites for unified multimodal behavior.

A plausible implication is that the paper treats “unification” in a stronger sense than earlier medical vision-language systems that align images and text but still optimize only for recognition or report generation. In UniMedVL, the target is simultaneous support for image understanding and visual generation within one model, rather than a loose collection of task-specific heads.

2. Observation level: UniMed-5M and multimodal data construction

At the observation level, the paper constructs UniMed-5M, a dataset comprising over 5.6M medical samples assembled from public corpora such as PMC-OA, Quilt-1M, PubMedVision, GMAI-VL datasets, CheXpertPlus, PMC-VQA, Medical-Diff-VQA, BigBio, and several task-specific datasets (Ning et al., 17 Oct 2025). The dataset spans nine primary imaging modalities: chest X-ray, histopathology, CT, MRI, color fundus photography, OCT, endoscopy, ultrasound, and fluorescence microscopy.

The central data-engineering operation is reformatting diverse unimodal resources into uniform multimodal input-output pairs. The paper divides the curated data into understanding data and generation data, and further constructs interleaved tasks in which both input and output contain text and image components (Ning et al., 17 Oct 2025). This is important because the model is not trained only on conventional image-caption pairs; it is also exposed to tasks that structurally couple visual and textual reasoning in both directions.

The paper describes a three-step quality pipeline for dataset construction. First, it applies coarse filtering by resolution and text length. Second, it performs medical alignment using multiple generated captions per image from MedGemma-27B, combined semantic similarity from E5-large-v2 and medical-specific similarity from MedSigLIP. Third, it performs expert validation. The final alignment score is a weighted combination of textual similarity and medical alignment, with $\lambda = 0.5$ , and the top 50% of pairs are retained (Ning et al., 17 Oct 2025).

A major component of UniMed-5M is interleaved task construction. The paper highlights five such tasks: medical image promptable segmentation, super-resolution, interpretable counterfactual generation, virtual staining, and cross-modal synthesis. These are built through templateization and VLLM captioning, the former converting inputs and outputs into structured image-text pairs and the latter generating richer textual descriptions of medical images, including anatomy and clinically meaningful visual details (Ning et al., 17 Oct 2025).

The appendix-level staging reported in the paper gives the data progression concrete scale: Stage 1 includes about 4.0M understanding examples and 1.6M generation examples, Stage 2 adds about 1.9M instruction-tuning examples, and Stage 3 adds about 330K interleaved examples (Ning et al., 17 Oct 2025). This supports the paper’s claim that the dataset is not merely large, but intentionally staged for curriculum-based multimodal learning.

3. Knowledge level: Progressive Curriculum Learning

At the knowledge level, UniMedVL depends on Progressive Curriculum Learning, which the paper presents as a response to the insufficiency of naive joint training for understanding and generation (Ning et al., 17 Oct 2025). The curriculum is structured into three stages.

Stage 1 is foundation training, intended to build broad medical pattern recognition and basic image-text alignment from the large paired corpus. The emphasis is on exposure to diverse medical visual and textual signals rather than task specialization. Stage 2 is instruction tuning, where curated, higher-quality instruction data are introduced. The paper writes the instruction format as $(q, x_v, k) \rightarrow (a_t, a_v)$ , where a query $q$ , visual input $x_v$ , and knowledge context $k$ produce textual and/or visual answers. Stage 3 is unified multimodal training, where interleaved tasks are introduced so that the model jointly reasons across modalities in more complex sequences (Ning et al., 17 Oct 2025).

For understanding tasks, Stage 2 adds distilled chain-of-thought data to make reasoning explicit. For generation tasks, it uses a Caption Augmented Generation pipeline to improve prompt quality by fusing structured descriptions with original captions. The purpose of Stage 3 is not simply to add more tasks, but to force joint multimodal analysis under interleaved input-output conditions (Ning et al., 17 Oct 2025).

The training schedule is specified concretely. Stage 1 runs for 85K steps at $5 \times 10^{-5}$ , Stage 2 for 120K steps at $2.5 \times 10^{-5}$ , and Stage 3 for 70K steps at $1.0 \times 10^{-5}$ . The optimizer is AdamW, with a CE:MSE loss weight ratio of 0.25:1.0. The ViT encoder is trainable in Stage 1 but frozen later, while the VAE remains frozen throughout (Ning et al., 17 Oct 2025). The data mix also changes over time: Stage 1 heavily favors image-to-text tasks, Stage 2 balances image-to-text and text-to-image while introducing interleaved examples, and Stage 3 increases the interleaved proportion further.

This staged design is central to the paper’s thesis. Rather than treating understanding and generation as immediately coequal objectives, the curriculum first establishes observation-level alignment, then adds explicit instruction-following and reasoning, and only later emphasizes full interleaving. This suggests that the unification objective is implemented as a training trajectory, not merely as an architectural property.

4. Architecture and optimization objectives

At the analysis level, UniMedVL is implemented with dual visual encoders plus a mixture-of-transformer-experts (Ning et al., 17 Oct 2025). One encoder is an understanding-oriented ViT, producing semantic tokens $z_{\text{ViT}} = E_{\text{ViT}}(x_v)$ . The other is a generation-oriented VAE encoder, producing latent tokens $z_{\text{VAE}} = E_{\text{VAE}}(x_v)$ . Projection layers $(q, x_v, k) \rightarrow (a_t, a_v)$ 0 and $(q, x_v, k) \rightarrow (a_t, a_v)$ 1 map these features into a common hidden space.

The transformer contains specialized experts. An understanding expert processes text plus ViT tokens, while a generation expert operates on VAE latent tokens and uses text conditioning through cross-attention. A VAE decoder $(q, x_v, k) \rightarrow (a_t, a_v)$ 2 reconstructs pixels from the latent trajectory (Ning et al., 17 Oct 2025). The architectural point is that understanding and generation remain distinct pathways internally, yet are realized within one connected model rather than in separate systems.

The training objective combines next-token prediction for understanding and flow matching for generation. The understanding loss is

$(q, x_v, k) \rightarrow (a_t, a_v)$ 3

and the generation loss is

$(q, x_v, k) \rightarrow (a_t, a_v)$ 4

with

$(q, x_v, k) \rightarrow (a_t, a_v)$ 5

where $(q, x_v, k) \rightarrow (a_t, a_v)$ 6 is the clean latent, $(q, x_v, k) \rightarrow (a_t, a_v)$ 7 is noise, $(q, x_v, k) \rightarrow (a_t, a_v)$ 8 is the velocity predictor, and $(q, x_v, k) \rightarrow (a_t, a_v)$ 9 is text conditioning. The final loss is

$q$ 0

The appendix additionally states that the model uses a pretrained FLUX VAE without domain-specific fine-tuning because reconstruction experiments across eight medical modalities showed it was already competitive and because keeping it fixed helps stability (Ning et al., 17 Oct 2025).

This architecture differs from universal medical systems that unify only recognition tasks. For comparison, UMIT reuses a Qwen2-VL backbone with a vision encoder, a linear projection layer, and a LLM decoder to support visual question answering, disease detection, and report generation across multiple imaging modalities and both English and Chinese, but it is not described as jointly solving understanding and image generation within a single architecture (Yu et al., 20 Mar 2025). MedUnifier, by contrast, also explicitly combines image-text understanding and text-grounded image generation, but does so through an image-text encoder, a text generator, and a VQ-VAE-based image generator trained on radiology image-report pairs (Zhang et al., 2 Mar 2025). UniMedVL’s distinctive claim within this landscape is simultaneous medical image understanding and generation under the OKA curriculum (Ning et al., 17 Oct 2025).

5. Evaluation and empirical behavior

The paper evaluates UniMedVL on five medical image understanding benchmarks—VQA-RAD, SLAKE, PathVQA, OmniMedVQA, and GMAI-MMBench—and on generation across eight modalities: CFP, CXR, CT, histopathology, MRI, OCT, ultrasound, and endoscopy (Ning et al., 17 Oct 2025). For generation, the reported metrics are gFID and BiomedCLIP score; for interleaved tasks, PSNR and SSIM are used, and for counterfactual generation the paper also reports AUROC, F1, BLEU-3, METEOR, and ROUGE-L.

On understanding benchmarks, UniMedVL gets 61.9 on VQA-RAD, 75.4 on SLAKE, 53.5 on PathVQA, 85.8 on OmniMedVQA, and 60.75 on GMAI-MMBench. The paper emphasizes that this is best among unified models and near-specialized performance; for example, on OmniMedVQA it trails specialized GMAI-VL by only 2.7 points, while on GMAI-MMBench it is almost tied with GMAI-VL (Ning et al., 17 Oct 2025).

On generation, the full model improves average gFID to 96.29 and average BiomedCLIP to 0.706, outperforming the generation-only variant UniMedVL-Gen, which has average gFID 108.40 and BiomedCLIP 0.699 (Ning et al., 17 Oct 2025). The paper interprets this as evidence that understanding and generation reinforce each other rather than merely compete for capacity.

The interleaved task results are used to argue that UniMedVL is not merely a multitask adapter. On counterfactual CXR generation, it achieves 27.17 gFID, 0.7970 AUROC, 0.8731 F1, and stronger textual explanation scores than the specialized baseline CXR-IRGen and the prior strongest counterfactual model ProgEmu. On virtual H&E-to-IHC staining, the full model reaches 20.27 PSNR and 0.456 SSIM, improving over the Stage-3-only model and beating HealthGPT-M3 by a large margin. For MRI super-resolution, UniMedVL obtains 27.29 PSNR and 0.890 SSIM. For T2↔FLAIR translation, it reaches about 25.07 average PSNR and 0.882 SSIM (Ning et al., 17 Oct 2025).

These results matter because the paper’s claim of unification is not limited to report generation or VQA. The evaluation spans recognition, generation, and interleaved analysis, which makes the reported performance central to the model’s identity rather than an auxiliary demonstration.

A central claim of the paper is that generation tasks enhance visual understanding features and that understanding supplies semantics that improve generation fidelity (Ning et al., 17 Oct 2025). The evidence comes from ablation and from comparisons among understanding-only, generation-only, and jointly trained variants.

In Stage 1, the joint model outperforms single-task variants, suggesting that the backbone learns more robust multimodal representations when both directions are present. In Stage 2, adding reasoning-heavy instructions and higher-quality captions improves both understanding and generation. In Stage 3, introducing interleaved tasks yields the largest gains, increasing understanding accuracy further and improving generation quality substantially, especially in gFID (Ning et al., 17 Oct 2025).

The paper explicitly interprets this as bidirectional knowledge sharing. Generation contributes fine-grained visual structure and semantic constraints, which help understanding; understanding contributes rich semantics that improve generation fidelity. This is the reason the full UniMedVL outperforms UniMedVL-Gen on generation quality rather than degrading from shared capacity (Ning et al., 17 Oct 2025).

This emphasis on reciprocal benefit places UniMedVL in a broader line of research attempting to unify medical vision-language systems beyond narrow task silos. Fleming-VL, for example, frames a single medical MLLM as needing to reason over heterogeneous clinical visual inputs including 2D images, 3D volumes, and video, and it uses a data-centric strategy of scaled long-context pretraining, rare-modality enrichment, and expanded evaluation to support a universal medical visual reasoning framework (Shu et al., 2 Nov 2025). UniMedVL differs in scope—its paper is centered on simultaneous understanding and generation rather than 2D/3D/video unification—but both systems are built around the premise that medical multimodal competence should not be partitioned into disconnected specialist models.

7. Relation to the universal medical vision-language literature and known limitations

Within the broader literature, UniMedVL belongs to a family of attempts to construct universal or unified medical vision-language systems. PTUnifier proposes soft prompts so that a single backbone can process image-only inputs, text-only inputs, and image-text pairs while combining dual-encoder-style and fusion-encoder-style objectives (Chen et al., 2023). UniMedI uses diagnostic reports as a common semantic space to learn unified representations for 2D X-rays and 3D CT volumes (He et al., 2023). UniMed-CLIP trains a single dual-encoder over over 5.3 million image-text pairs across X-ray, CT, MRI, Ultrasound, Pathology, and Fundus to support unified contrastive representation learning (Khattak et al., 2024). Uni-Med uses a connector mixture-of-experts to mitigate multi-task interference in a medical generalist MLLM spanning question answering, visual question answering, report generation, referring expression tasks, and image classification (Zhu et al., 2024).

Against this background, UniMedVL is distinctive in two respects. First, it treats image understanding and image generation as coequal components of one architecture rather than adjacent tasks. Second, it grounds that unification in the OKA framework, with explicit observation-level dataset construction and knowledge-level curriculum before analysis-level modeling (Ning et al., 17 Oct 2025). This suggests that the paper views architectural unification alone as insufficient without staged multimodal data formation and staged optimization.

The paper’s limitations are less heavily emphasized than its results, but several are stated or implied. Its dataset and evaluation are broad, yet still shaped by curated public corpora and staged construction. The benchmarked success is strongest where the model has observation-level and interleaved-task support. This suggests that UniMedVL’s notion of “unified” remains dependent on data availability and task formulation. A plausible implication is that the model demonstrates unified multimodal analysis within the evaluated medical imaging regimes, rather than resolving all clinical workflow, deployment, or safety questions.

In the current literature, UniMedVL can therefore be understood as a specific realization of the broader “unified medical multimodal” program: a model that attempts to collapse the traditional separation between medical image interpretation and medical image generation by aligning data construction, curriculum, architecture, and objectives around a single multimodal system (Ning et al., 17 Oct 2025).