Hulu-Med: Transparent Multimodal Medical VLM

Updated 4 July 2026

Hulu-Med is a transparent, generalist medical vision-language model that fuses text, 2D images, 3D volumes, and video for holistic clinical understanding.
It employs a unified patch-based encoder and LLM decoder, trained progressively on 16.7 million samples to support diverse modalities seamlessly.
The model uses medical-aware token reduction and open-source practices to achieve state-of-the-art performance across 30 benchmarks and ensure reproducibility.

Hulu-Med is a transparent generalist medical vision-LLM intended for holistic medical understanding across medical text, 2D images, 3D volumes, and video. It is built from a unified patch-based vision encoder and an LLM decoder, and it was progressively trained on 16.7 million samples to scale from 2D to 3D and video comprehension. The model family includes 7B, 14B, and 32B variants, uses medical-aware token reduction to control sequence length and training cost, and is reported to achieve state-of-the-art performance across 30 benchmarks spanning visual question answering, medical report generation, complex reasoning, multilingual evaluation, and rare disease scenarios (Jiang et al., 9 Oct 2025).

1. Scope, motivation, and design objectives

Hulu-Med was introduced to address a recurrent limitation in clinical AI systems: real-world decision-making is multimodal, but most deployed models remain narrow, single-modality, or task-specific. The motivating examples span radiology, pathology, dermatology, endoscopy, surgery, and free-text reasoning. In that setting, clinicians must integrate outputs from separate systems, which the authors describe as inefficient and potentially prone to missed cross-modal signals (Jiang et al., 9 Oct 2025).

The model is therefore framed as a single medical foundation model for four input regimes: medical text-only reasoning, 2D medical images, 3D volumes such as CT and MRI, and videos such as surgical or ultrasound sequences. Its stated goal is not merely multimodal fusion in the generic sense, but native support for text plus all of these visual modalities within one encoder-decoder stack. The paper identifies two barriers in prior work: opaque development pipelines with private data and undisclosed preprocessing, and architectural rigidity caused by modality-specific encoders or incomplete modality coverage (Jiang et al., 9 Oct 2025).

This positioning is central to the term “transparent” in Hulu-Med. Transparency refers to the use of public or synthetic data built on public resources, open backbones, released code and weights, documented stage-wise training mixtures, and explicit reporting of training recipes and GPU-hour budgets. A plausible implication is that Hulu-Med was intended not only as a model checkpoint but also as a reproducible reference pipeline for medical multimodal pretraining and instruction tuning.

2. Architecture and modality unification

Hulu-Med is a decoder-only multimodal LLM with four components: a Rotary Position-Adaptive visual encoder, a text tokenizer, a multimodal projector, and an LLM decoder (Jiang et al., 9 Oct 2025). The encoder is a 27-layer SigLIP-based ViT with hidden size 1152, MLP intermediate size 4304, 16 attention heads, and a patch size of $16\times16$ pixels. The text side uses the native tokenizer of the LLM backbone, with a BPE vocabulary of 152,064.

Variant	LLM backbone	Key note
Hulu-Med-7B	Qwen2.5-7B-Instruct	28 layers, hidden size 3584
Hulu-Med-14B	Qwen3-14B-Instruct	Larger decoder variant
Hulu-Med-32B	Qwen2.5-32B-Instruct	Largest released model

The unification mechanism is architectural rather than task-specific. A 2D image is partitioned directly into non-overlapping $16\times16$ patches. A 3D volume is decomposed into 2D slices, each treated as an image plane. A video is sampled into frames, again treated as image planes. As a result, every visual input becomes a variable-length sequence of 2D patch tokens, and no modality-specific encoder is introduced (Jiang et al., 9 Oct 2025).

To avoid fixed-resolution assumptions, Hulu-Med replaces absolute position embeddings with 2D rotary position embeddings. For a patch at coordinates $(m,n)$ , the feature vector is split into height and width halves,

$\mathbf{x} = [\mathbf{x}_h; \mathbf{x}_w],\qquad \mathbf{x}_h, \mathbf{x}_w \in \mathbb{R}^{d/2},$

and 1D RoPE is applied independently along each axis. For a sub-vector $\mathbf{v}$ and position $p\in\{m,n\}$ ,

$\begin{pmatrix} v'_{2i-1} \ v'_{2i} \end{pmatrix} = \begin{pmatrix} \cos(p\theta_i) & -\sin(p\theta_i) \ \sin(p\theta_i) & \cos(p\theta_i) \end{pmatrix} \begin{pmatrix} v_{2i-1} \ v_{2i} \end{pmatrix}, \quad \theta_i = 10000^{-2i/d}.$

This encoding is applied to queries and keys in self-attention, embedding relative spatial structure directly in attention scores (Jiang et al., 9 Oct 2025).

The visual encoder output

$\mathbf{H}_v \in \mathbb{R}^{N \times 1152}$

is projected into the LLM embedding space by a two-layer MLP with GELU,

$\mathbf{H}_{\text{proj}} = W_2\cdot \text{GELU}(W_1\mathbf{H}_v + b_1) + b_2,$

where the output dimension matches the decoder hidden size. Visual and text tokens are then concatenated into a single causal sequence such as

$[\texttt{<bos>}, \text{text tokens}, \text{vision tokens}, \dots].$

The decoder predicts text autoregressively with standard causal masking, and all parameters are trainable in stages 2 and 3 of the curriculum (Jiang et al., 9 Oct 2025).

A common misconception is that a single encoder must be inferior to modality-specific specialist encoders. Hulu-Med’s ablation evidence runs in the opposite direction: five single-modality models trained separately on ultrasound, OCT, fundus, microscopy, and dermoscopy were outperformed by the mixed-modality Hulu-Med on each corresponding modality, which the authors interpret as a benefit of cross-modal sharing (Jiang et al., 9 Oct 2025).

3. Medical-aware token reduction, training curriculum, and data

The main systems challenge for 3D and video input is token explosion. Hulu-Med addresses this with a two-stage medical-aware token reduction mechanism (Jiang et al., 9 Oct 2025). First, for 3D and video inputs, intra-plane pooling downsamples each patch grid by a factor of 2 along height and width, effectively merging each $16\times16$ 0 block and reducing the per-plane token count by $16\times16$ 1. Second, inter-plane pruning compares corresponding patch embeddings in adjacent slices or frames and removes tokens whose $16\times16$ 2 distance falls below a threshold:

$16\times16$ 3

If $16\times16$ 4 with $16\times16$ 5, the token is treated as redundant (Jiang et al., 9 Oct 2025).

The reported effect is substantial. Average token pruning for 3D and video is about 55%, GPU memory is reduced by about 43% on long surgical videos, and the resulting training budgets are approximately 4,100 GPU hours for Hulu-Med-7B and 38,000 GPU hours for Hulu-Med-32B, summarized in the paper as “4,000 to 40,000 GPU hours” (Jiang et al., 9 Oct 2025). The paper further states that performance degradation under this reduction is minimal on 3D tasks and “nearly identical” on surgical video benchmarks.

Training follows a three-stage progressive curriculum built on 16.7 million samples, with detailed tables summing to about 16.6 million. The corpus spans 12 anatomical systems and 14 imaging modalities, comprising more than 65 sub-modalities (Jiang et al., 9 Oct 2025).

Stage 1 performs vision-language alignment on 1.42 million short-caption image-text pairs. Representative datasets include Quilt-LLaVA-Pretrain, biomedica-clinical, MedICaT, ROCO-radiology, and MedPix 2.0. In this stage, the LLM is frozen, while the vision encoder and projector are trained using an autoregressive captioning objective. The learning rates are $16\times16$ 6 for the ViT and $16\times16$ 7 for the projector (Jiang et al., 9 Oct 2025).

Stage 2 performs medical multimodal continual pretraining on 4.85 million samples. Roughly 2.27 million are synthetic medical captions, and about 2.58 million are public data. Synthetic captioning is generated by rewriting short captions into long descriptions or by caption generation followed by a judge pipeline. All components are trainable, with learning rates $16\times16$ 8 for the ViT, $16\times16$ 9 for the projector, and $(m,n)$ 0 for the LLM, under cosine decay. The loss is standard autoregressive LM loss over mixed tasks:

$(m,n)$ 1

(Jiang et al., 9 Oct 2025)

Stage 3 performs mixed-modality instruction tuning on 10.46 million samples, including 5.96 million text-only examples and 4.5 million multimodal examples. The medical text portion includes Apollo, MedQuAD, MedReason, MedThoughts-8K, Medical-o1, Medical-R1-Distill, ReasonMed, II-Medical-Reasoning-SFT, MMedC, and clinical dialogue sources such as Miriad, HealthCareMagic, and iCliniq. The multimodal portion includes 2D VQA, report generation, 3D captioning and VQA, video captioning and VQA, interleaved multi-image data, and general multimodal corpora. Additional synthetic pipelines contribute multilingual chain-of-thought reasoning, roughly 600k extra VQA pairs from long captions, and about 20k video captions via divide-and-conquer captioning (Jiang et al., 9 Oct 2025).

The paper reports that the best empirical data mixture is approximately $(m,n)$ 2 medical to general and $(m,n)$ 3 text to multimodal. It also reports that progressive staging from 2D to 3D to video is superior to training all modalities together from the outset (Jiang et al., 9 Oct 2025).

4. Tasks, prompting formats, and empirical performance

Hulu-Med supports medical visual question answering, medical report generation, classification, text-only medical QA and reasoning, multilingual QA, clinical dialogues, and rare disease diagnosis (Jiang et al., 9 Oct 2025). Its prompting is instruction-following and task-sensitive. Multiple-choice questions may request direct option letters or step-by-step reasoning with the final answer enclosed in \boxed{}. Judgment prompts instruct the model to output only “yes” or “no”. Close-ended prompts request a single word or phrase, while open-ended prompts request concise answers or explicit chain-of-thought (Jiang et al., 9 Oct 2025).

On 2D VQA, the paper reports strong results across OmniMedVQA, PMC-VQA, VQA-RAD, SLAKE, PathVQA, MedXQA, and MMMU-Med. Hulu-Med-7B achieves 84.2 on OmniMedVQA, compared with 71.0 for Gemini-2.5-Flash and 82.9 for Lingshu-7B. Hulu-Med-32B reaches 69.4 on PMC-VQA and 81.4 on VQA-RAD. On SLAKE, Hulu-Med-7B reaches 86.8. On PathVQA, Hulu-Med-32B reaches 67.3. On MedXQA, Hulu-Med-32B reaches 34.0, which the paper identifies as best among medical VLMs, although proprietary general models such as Gemini-2.5-Flash remain higher at 52.8. On MMMU-Med, Hulu-Med-32B reaches 60.4, below InternVL3-38B at 65.2 and Gemini-2.5-Flash at 76.9 (Jiang et al., 9 Oct 2025).

For 2D medical report generation on MIMIC-CXR, CheXpert, and IU-Xray, Hulu-Med is reported to set state-of-the-art performance among open-source medical and general models. A highlighted result is MIMIC-CXR RaTEScore 57.0 for Hulu-Med-7B, compared with 51.3 for MedGemma-4B/27B. The paper notes that, in MedGemma, a RaTEScore of 51.3 corresponded to 81% of generated reports supporting equivalent or better decisions than originals; this contextualizes the reported Hulu-Med score as clinically meaningful within the cited comparison framework (Jiang et al., 9 Oct 2025).

On MedMNIST-2D, Hulu-Med achieves approximately 85+% average accuracy over seven tasks, well above GPT-4o at about 40% (Jiang et al., 9 Oct 2025). On 3D understanding, Hulu-Med leads on M3D and AMOS-MM and outperforms all baselines on 3D-RAD subtasks including anomaly detection, existence classification, and image observation. The most explicit gain is on longitudinal temporal diagnosis, where Hulu-Med-7B exceeds the previous best by 22.8 percentage points (Jiang et al., 9 Oct 2025).

On medical video, the model is evaluated on MedFrameQA, Cholec80-VQA, EndoVis18-VQA, PSI-AVA-VQA, Surgery Video QA, and MedFrameQA frame-count and modality breakdowns. Hulu-Med-14B reports MedFrameQA accuracies of 60.29% for 2 frames, 60.63% for 3 frames, 57.81% for 4 frames, and 59.85% for 5 frames, exceeding all proprietary baselines reported in the original MedFrameQA paper. Hulu-Med performs better than video foundation models on Cholec80-VQA and EndoVis-VQA and is comparable or slightly below them on PSI-AVA-VQA. On Surgery Video QA, an OOD benchmark built from public educational medical videos, GPT-4o reaches 44.8%, while Hulu-Med-32B reaches 30.1%, narrowly above Lingshu-32B at 29.9% (Jiang et al., 9 Oct 2025).

Text-only performance is also a major part of the system. Hulu-Med-32B reaches 72.9 on MMLU-Pro-Med, 68.8 on MedBullets, 80.8 on PubMedQA, 72.8 on MedMCQA, 80.4 on MedQA, and 85.6 on MMLU-Med. The paper states that Hulu-Med-7B leads other 7B–8B models on 7 of 8 text benchmarks, while scaling to 32B substantially improves reasoning (Jiang et al., 9 Oct 2025).

In multilingual and clinically oriented evaluation, Hulu-Med-32B reaches 75.13% on MMedBench, exceeding GPT-4 at 74.27%. On HealthBench, a physician-designed evaluation covering Global Health, Communication, Context Seeking, Emergency Referrals, Hedging, Health Data Tasks, and Complex Responses, Hulu-Med-32B reaches 41.6, above GPT-4o at 32.0. Hulu-Med-7B reaches 38.3, more than double HuatuoGPT-Vision-34B at 17.2 and Lingshu-7B at 15.9 (Jiang et al., 9 Oct 2025).

Rare disease reasoning is treated separately. On RareBench Task 4, vanilla Hulu-Med performance is described as modest, but explicit chain-of-thought prompting improves recall substantially; under this prompting, Hulu-Med surpasses all proprietary models in 7 of 8 testing scenarios. The authors also note one counterexample in which Hulu-Med-32B underperforms Hulu-Med-7B, suggesting that larger reasoning capacity can overfit without sufficient long-CoT training (Jiang et al., 9 Oct 2025).

5. Transparency, reproducibility, and relation to adjacent systems

Transparency is not ancillary to Hulu-Med; it is one of the model’s central claims. The paper states that code and weights are open-sourced, the data sources are public or synthetic from public resources, dataset composition and licenses are documented, token reduction parameters are explicit, and GPU-hour budgets are reported (Jiang et al., 9 Oct 2025). The use of open backbones such as SigLIP and Qwen further reinforces this design choice.

This contrasts with systems whose principal contribution lies elsewhere. HuaTuo, for example, is a Chinese biomedical LLM built by supervised fine-tuning LLaMA-7B on about 8,000 Chinese medical QA pairs derived from CMeKG and ChatGPT; it addresses Chinese biomedical dialogue rather than multimodal 2D/3D/video understanding (Wang et al., 2023). HOPPR, by contrast, is described as a medical-grade platform for training, fine-tuning, hosting, and deploying proprietary LVLMs on large deidentified imaging corpora under ISO 13485-aligned quality management; it is infrastructure and deployment scaffolding rather than a single transparent open model (Slavkova et al., 2024). Med-Banana-50K addresses text-guided medical image editing with lesion addition and removal across chest X-ray, brain MRI, and fundus photography, which places it adjacent to, but not equivalent with, Hulu-Med’s focus on understanding rather than image editing (Chen et al., 2 Nov 2025). Med-NCA addresses lightweight medical image segmentation under severe compute constraints using neural cellular automata, a markedly different problem formulation from generalist multimodal reasoning (Kalkhof et al., 2023).

A common misconception is that openness necessarily implies lower performance. The results reported for Hulu-Med argue against that simplification: the model surpasses leading open-source models across many benchmarks and, on some medical tasks such as MMedBench and selected VQA and report-generation settings, competes with or exceeds proprietary systems (Jiang et al., 9 Oct 2025). The converse misconception—that a transparent open model is thereby ready for clinical deployment—is equally unsupported. The paper explicitly presents Hulu-Med as a research model rather than a regulatory-approved system.

6. Clinical relevance, limitations, and open problems

The paper situates Hulu-Med in several prospective use cases: radiology decision support, 2D and 3D report drafting, comparison of current and prior imaging, surgical assistance and education, multilingual medical training, clinical dialogue and triage, and rare disease support (Jiang et al., 9 Oct 2025). In radiology, case studies are described as showing fewer hallucinated findings and more clinically meaningful summaries than MedGemma. In surgery, the token-reduction mechanism enables processing of videos longer than one hour with about 55% token pruning and 43% lower GPU memory. In multilingual settings, the model performs strongly in Chinese and French, with more room for improvement in Spanish and Russian (Jiang et al., 9 Oct 2025).

The limitations are stated with similar clarity. Hulu-Med currently handles text and visual data only; it does not yet integrate genomics, laboratory values, or waveforms. The training pipeline uses supervised LM objectives and some chain-of-thought augmentation but no RLHF or preference optimization. The model may still hallucinate or reason incorrectly, especially in complex computations, low-resource modalities, rare disease settings without CoT prompting, and long-tail languages. Biases in public datasets and synthetic pipelines may propagate into outputs. The 14B and 32B variants remain heavy for real-time on-device deployment, despite the efficiency gains from token reduction (Jiang et al., 9 Oct 2025).

The paper also identifies future directions. These include extending beyond text and vision toward “multi-scale” models incorporating omics for precision medicine, improving reasoning with reinforcement learning on large CoT datasets, and continuing domain adaptation through continual pretraining (Jiang et al., 9 Oct 2025). This suggests that Hulu-Med is best understood as a foundation for transparent medical multimodal modeling rather than a final architecture.

In the current landscape, Hulu-Med’s principal significance lies in demonstrating that a single open model can unify text, 2D, 3D, and video under one patch-based encoder-decoder design, scale through a staged curriculum on public and synthetic data, and attain strong or state-of-the-art performance across a broad medical benchmark suite (Jiang et al., 9 Oct 2025). Its contribution is therefore both algorithmic and infrastructural: a generalist medical VLM, and a documented recipe for building one.