HealthGPT-XL32: Unified Med-VL Model
- HealthGPT-XL32 is a medical vision-language model that integrates hierarchical visual perception with autoregressive text and image generation for comprehensive diagnostics.
- It employs modular design with Heterogeneous Low-Rank Adaptation to enable efficient multitask transfer across tasks like report generation and modality conversion.
- A three-stage learning strategy, powered by the extensive VL-Health dataset, underpins marked advances in both diagnostic interpretation and image reconstruction.
HealthGPT-XL32 is a medical Large Vision-LLM (Med-LVLM) that integrates medical visual comprehension and image generation in a single unified autoregressive architecture. It is designed to unify understanding and generative capabilities across a broad spectrum of medical domains and task types by leveraging a novel combination of hierarchical visual perception, heterogeneous low-rank adaptation (H-LoRA), and an orchestrated three-stage learning strategy. The model is underpinned by the VL-Health dataset, providing extensive coverage of modalities and clinical tasks, and demonstrates marked advances in both accuracy and scalability compared to previous approaches (Lin et al., 14 Feb 2025).
1. Model Architecture and Unified Autoregressive Paradigm
HealthGPT employs a modular architecture that fuses CLIP-L/14 ViT for hierarchical visual feature extraction with a frozen pre-trained LLM, such as Φ-3-mini or Φ-4, via multimodal adapters. The image is first passed through the Vision Transformer yielding L hidden states . Hierarchical Visual Perception (HVP) dichotomizes these features into:
- Concrete-grained features for fine-grained generative tasks (e.g., image synthesis);
- Abstract-grained features for high-level comprehension tasks (e.g., report interpretation, VQA).
The task-specific visual features are projected through a 2-layer MLP adapter into the LLM's token space and concatenated with tokenized text . The resulting sequence is then processed by the transformer. At each transformer block, H-LoRA modules inject low-rank updates appropriate to the task.
Output generation follows a unified autoregressive schedule. For comprehension, the model predicts in the text vocabulary :
For image generation, the vocabulary is augmented to include (VQGAN-f8-8192 codebook) alongside special , tokens. Generation proceeds autoregressively over VQ indices, with the sampled sequence decoded by VQGAN.
2. Heterogeneous Low-Rank Adaptation (H-LoRA)
H-LoRA is the core adaptation mechanism enabling effective multitask transfer within HealthGPT. Each task type, , possesses its LoRA submodule . For each transformer weight :
- low-rank pairs are defined;
- These form "wide" matrices and by concatenation.
A lightweight routing network computes per-token gates , expanded along the rank dimension:
The H-LoRA update is then:
This mechanism permits adaptive subspace specialization, decoupling "comprehension" and "generation" expert knowledge, while maintaining computational efficiency. No additional regularization is applied; update magnitudes are modulated via a scaling factor . The approach remains cost-invariant to expert count .
3. Three-Stage Learning Strategy
Training HealthGPT proceeds via an orchestrated three-stage regime:
- Multi-modal Alignment: The LLM (including extended vocabulary for VQ tokens) is frozen. The comprehension branch trains "abstract-grained" adapters and -LoRA modules on high-quality image–text pairs, minimizing standard cross-entropy loss over text. The generation branch simultaneously trains "concrete-grained" adapters and -LoRA modules on image–VQ index pairs, minimizing cross-entropy over VQ tokens.
- Heterogeneous H-LoRA Plugin Adaptation: All LoRA plugins are frozen. Fine-tuning targets only the shared word embedding and output head. A mixed batch of samples comprising both comprehension and generation is used to harmonize output distributions.
- Visual Instruction Fine-Tuning: Only the H-LoRA modules and adapters remain trainable; embeddings and output head are frozen. Supervised on downstream medical instructional datasets, this stage covers comprehension (VQA, dialogue, report generation) and generation tasks (super-resolution, denoising, modality conversion, report-to-image), with task-appropriate loss functions (text/VQ-token cross-entropy, implicit VQGAN reconstruction metrics).
A staged approach yields substantial gains (+15–20 points on VQA, +5–8 points on modality conversion) over naïve mixed training.
4. VL-Health Dataset Construction and Task Scope
VL-Health, the dataset supporting HealthGPT, comprises approximately $1.55$ million samples spanning 11 imaging modalities (CT, MRI, X-ray, microscopy, OCT, ultrasound, fundus, and others). It unifies sources for both comprehension (765k samples from PubMedVision, LLaVA-Med, MIMIC-CXR-VQA, PathVQA, SLAKE, VQA-RAD) and generation (783k samples from LLaVA-558k, IXI, SynthRAD2023, MIMIC-CHEST-XRAY) branches.
HealthGPT supports a broad array of medical vision-language tasks:
| Task Type | Example Benchmarks/Use Cases | Modalities |
|---|---|---|
| VQA (open/MC) | VQA-RAD, SLAKE, PathVQA | All 11 |
| Report Generation | MIMIC-CXR | CXR, CT, MRI |
| Image Generation | Report→CXR, modality conversion | CT⇆MRI, CXR |
| Super-Resolution | IXI 4× | MRI |
| Modality Conversion | SynthRAD2023 (CT⇆MRI brain, pelvis) | CT, MRI |
Tasks are evaluated using standard and modality-appropriate metrics (SSIM, PSNR, MSE, LPIPS) in addition to accuracy and recall for comprehension settings.
5. Empirical Performance and Ablation Analysis
HealthGPT demonstrates leading performance in both comprehension and generation benchmarks. On VQA-RAD (3.8B M3/14B L14), closed/all performance is M3: 73.7/55.9 compared to 66.9/53.0 for HuatuoGPT-Vision; L14: 77.7/58.3. SLAKE and PathVQA tasks show similar trends, with HealthGPT consistently exceeding baselines (e.g., OmniMedVQA avg: M3=68.5, L14=74.4 vs best baseline 63.2).
For generation, modality conversion (CT→MRI, MRI→CT) yields SSIM scores of 79.38/71.81/85.06/84.23, surpassing Pix2Pix (71.09/59.17/78.79/72.31). On IXI super-resolution, HealthGPT achieves SSIM=78.19, PSNR=32.76, compared with SRGAN (71.34/32.01). Image reconstruction tasks exhibit substantial gains over Unified-IO 2 and SEED-X models.
Ablations indicate H-LoRA surpasses both LoRA (comprehension avg: 73.7 vs 71.3) and MoELoRA (73.7 vs 72.5) for comprehension, and maintains parity or better in generation. No additional training cost is incurred versus standard LoRA (H-LoRA 1.00× baseline, MoELoRA 1.49× slower). Hierarchical visual feature selection hastens convergence: abstract features benefit comprehension; concrete features optimize generative quality.
Clinician-driven human evaluation (>1,000 samples) confirms HealthGPT-L14 answers are preferred in >50% of cases, greatly exceeding the next best baseline (<25%). Case studies demonstrate precise control in report-to-CXR generation and robust preservation of anatomical detail in modality transfer scenarios.
6. Analysis and Broader Implications
HealthGPT's design demonstrates that joint hierarchical perception, modular low-rank adaptation, and staged multi-modal learning can address both data scarcity and scalability in complex medical VLP tasks within a single Med-LVLM. These techniques collectively enable substantial improvements over task- or modality-restricted baselines, while maintaining computational efficiency and headroom for scaling from 3.8B to 14B parameters. The approach generalizes across highly heterogenous tasks and data, facilitating unified models in medical AI, and sets a significant benchmark for subsequent research in medical multi-modal foundation models (Lin et al., 14 Feb 2025).