HealthGPT-XL32: Unified Med-VL Model

Updated 23 February 2026

HealthGPT-XL32 is a medical vision-language model that integrates hierarchical visual perception with autoregressive text and image generation for comprehensive diagnostics.
It employs modular design with Heterogeneous Low-Rank Adaptation to enable efficient multitask transfer across tasks like report generation and modality conversion.
A three-stage learning strategy, powered by the extensive VL-Health dataset, underpins marked advances in both diagnostic interpretation and image reconstruction.

HealthGPT-XL32 is a medical Large Vision-LLM (Med-LVLM) that integrates medical visual comprehension and image generation in a single unified autoregressive architecture. It is designed to unify understanding and generative capabilities across a broad spectrum of medical domains and task types by leveraging a novel combination of hierarchical visual perception, heterogeneous low-rank adaptation (H-LoRA), and an orchestrated three-stage learning strategy. The model is underpinned by the VL-Health dataset, providing extensive coverage of modalities and clinical tasks, and demonstrates marked advances in both accuracy and scalability compared to previous approaches (Lin et al., 14 Feb 2025).

1. Model Architecture and Unified Autoregressive Paradigm

HealthGPT employs a modular architecture that fuses CLIP-L/14 ViT for hierarchical visual feature extraction with a frozen pre-trained LLM, such as Φ-3-mini or Φ-4, via multimodal adapters. The image $x^{img}$ is first passed through the Vision Transformer yielding L hidden states $\{f_1, ..., f_L\}$ . Hierarchical Visual Perception (HVP) dichotomizes these features into:

Concrete-grained features $F^{con} = \{f_1, ..., f_k\}$ for fine-grained generative tasks (e.g., image synthesis);
Abstract-grained features $F^{abs} = \{f_{k+1}, ..., f_L\}$ for high-level comprehension tasks (e.g., report interpretation, VQA).

The task-specific visual features $F^{img}_T$ are projected through a 2-layer MLP adapter into the LLM's token space and concatenated with tokenized text $x^{txt}$ . The resulting sequence $U = [F^{img}_T; T]$ is then processed by the transformer. At each transformer block, H-LoRA modules inject low-rank updates appropriate to the task.

Output generation follows a unified autoregressive schedule. For comprehension, the model predicts $r_1, ..., r_{N_r}$ in the text vocabulary $V_{txt}$ :

$P_\theta(R|U) = \prod_{i=1}^{N_r} P_\theta(r_i | U, r_{<i}).$

For image generation, the vocabulary is augmented to include $V_{vq} = \{0, ..., 8191\}$ (VQGAN-f8-8192 codebook) alongside special $\langle START\_IMG \rangle$ , $\langle END\_IMG \rangle$ tokens. Generation proceeds autoregressively over VQ indices, with the sampled sequence decoded by VQGAN.

2. Heterogeneous Low-Rank Adaptation (H-LoRA)

H-LoRA is the core adaptation mechanism enabling effective multitask transfer within HealthGPT. Each task type, $T \in \{\text{comprehension},~\text{generation}\}$ , possesses its LoRA submodule $\theta^T = \{A^T, B^T, R^{T}_{outer}\}$ . For each transformer weight $W_0 \in \mathbb{R}^{d_{in} \times d_{out}}$ :

$k$ low-rank pairs $\{A_i \in \mathbb{R}^{d_{in} \times r},~B_i \in \mathbb{R}^{r \times d_{out}}\}_{i=1}^k$ are defined;
These form "wide" matrices $A^{merged}$ and $B^{merged}$ by concatenation.

A lightweight routing network $R(\cdot)$ computes per-token gates $W \in \mathbb{R}^{\text{token\_num} \times k}$ , expanded along the rank dimension:

$W^{exp} = (\alpha \cdot k / r) \cdot (W \otimes 1_r).$

The H-LoRA update is then:

$O^{H-LoRA} = (x A^{merged} \odot W^{exp}) B^{merged}, \quad O = x W_0 + O^{H-LoRA}.$

This mechanism permits adaptive subspace specialization, decoupling "comprehension" and "generation" expert knowledge, while maintaining computational efficiency. No additional regularization is applied; update magnitudes are modulated via a scaling factor $\alpha$ . The approach remains cost-invariant to expert count $k$ .

3. Three-Stage Learning Strategy

Training HealthGPT proceeds via an orchestrated three-stage regime:

Multi-modal Alignment: The LLM (including extended vocabulary for VQ tokens) is frozen. The comprehension branch trains "abstract-grained" adapters and $H$ -LoRA $^\text{Comp}$ modules on high-quality image–text pairs, minimizing standard cross-entropy loss over text. The generation branch simultaneously trains "concrete-grained" adapters and $H$ -LoRA $^\text{Gen}$ modules on image–VQ index pairs, minimizing cross-entropy over VQ tokens.
Heterogeneous H-LoRA Plugin Adaptation: All LoRA plugins $\theta^T$ are frozen. Fine-tuning targets only the shared word embedding and output head. A mixed batch of $\sim 47\,000$ samples comprising both comprehension and generation is used to harmonize output distributions.
Visual Instruction Fine-Tuning: Only the H-LoRA modules and adapters remain trainable; embeddings and output head are frozen. Supervised on downstream medical instructional datasets, this stage covers comprehension (VQA, dialogue, report generation) and generation tasks (super-resolution, denoising, modality conversion, report-to-image), with task-appropriate loss functions (text/VQ-token cross-entropy, implicit VQGAN reconstruction metrics).

A staged approach yields substantial gains (+15–20 points on VQA, +5–8 points on modality conversion) over naïve mixed training.

4. VL-Health Dataset Construction and Task Scope

VL-Health, the dataset supporting HealthGPT, comprises approximately $1.55$ million samples spanning 11 imaging modalities (CT, MRI, X-ray, microscopy, OCT, ultrasound, fundus, and others). It unifies sources for both comprehension (765k samples from PubMedVision, LLaVA-Med, MIMIC-CXR-VQA, PathVQA, SLAKE, VQA-RAD) and generation (783k samples from LLaVA-558k, IXI, SynthRAD2023, MIMIC-CHEST-XRAY) branches.

HealthGPT supports a broad array of medical vision-language tasks:

Task Type	Example Benchmarks/Use Cases	Modalities
VQA (open/MC)	VQA-RAD, SLAKE, PathVQA	All 11
Report Generation	MIMIC-CXR	CXR, CT, MRI
Image Generation	Report→CXR, modality conversion	CT⇆MRI, CXR
Super-Resolution	IXI 4×	MRI
Modality Conversion	SynthRAD2023 (CT⇆MRI brain, pelvis)	CT, MRI

Tasks are evaluated using standard and modality-appropriate metrics (SSIM, PSNR, MSE, LPIPS) in addition to accuracy and recall for comprehension settings.

5. Empirical Performance and Ablation Analysis

HealthGPT demonstrates leading performance in both comprehension and generation benchmarks. On VQA-RAD (3.8B M3/14B L14), closed/all performance is M3: 73.7/55.9 compared to 66.9/53.0 for HuatuoGPT-Vision; L14: 77.7/58.3. SLAKE and PathVQA tasks show similar trends, with HealthGPT consistently exceeding baselines (e.g., OmniMedVQA avg: M3=68.5, L14=74.4 vs best baseline 63.2).

For generation, modality conversion (CT→MRI, MRI→CT) yields SSIM scores of 79.38/71.81/85.06/84.23, surpassing Pix2Pix (71.09/59.17/78.79/72.31). On IXI super-resolution, HealthGPT achieves SSIM=78.19, PSNR=32.76, compared with SRGAN (71.34/32.01). Image reconstruction tasks exhibit substantial gains over Unified-IO 2 and SEED-X models.

Ablations indicate H-LoRA surpasses both LoRA (comprehension avg: 73.7 vs 71.3) and MoELoRA (73.7 vs 72.5) for comprehension, and maintains parity or better in generation. No additional training cost is incurred versus standard LoRA (H-LoRA 1.00× baseline, MoELoRA 1.49× slower). Hierarchical visual feature selection hastens convergence: abstract features benefit comprehension; concrete features optimize generative quality.

Clinician-driven human evaluation (>1,000 samples) confirms HealthGPT-L14 answers are preferred in >50% of cases, greatly exceeding the next best baseline (<25%). Case studies demonstrate precise control in report-to-CXR generation and robust preservation of anatomical detail in modality transfer scenarios.

6. Analysis and Broader Implications

HealthGPT's design demonstrates that joint hierarchical perception, modular low-rank adaptation, and staged multi-modal learning can address both data scarcity and scalability in complex medical VLP tasks within a single Med-LVLM. These techniques collectively enable substantial improvements over task- or modality-restricted baselines, while maintaining computational efficiency and headroom for scaling from 3.8B to 14B parameters. The approach generalizes across highly heterogenous tasks and data, facilitating unified models in medical AI, and sets a significant benchmark for subsequent research in medical multi-modal foundation models (Lin et al., 14 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HealthGPT-XL32.

HealthGPT-XL32: Unified Med-VL Model

1. Model Architecture and Unified Autoregressive Paradigm

2. Heterogeneous Low-Rank Adaptation (H-LoRA)

3. Three-Stage Learning Strategy

4. VL-Health Dataset Construction and Task Scope

5. Empirical Performance and Ablation Analysis

6. Analysis and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HealthGPT-XL32: Unified Med-VL Model

1. Model Architecture and Unified Autoregressive Paradigm

2. Heterogeneous Low-Rank Adaptation (H-LoRA)

3. Three-Stage Learning Strategy

4. VL-Health Dataset Construction and Task Scope

5. Empirical Performance and Ablation Analysis

6. Analysis and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research