Papers
Topics
Authors
Recent
2000 character limit reached

QoQ-Med: Multimodal Clinical AI

Updated 9 December 2025
  • QoQ-Med models are a family of multimodal clinical foundation models that integrate medical images, physiological signals, and textual records to perform comprehensive clinical reasoning.
  • They utilize a pretrained vision-language backbone with specialized encoders and interleaved token fusion to combine heterogeneous clinical data in a single LLM framework.
  • The DRPO training protocol dynamically scales rewards for underrepresented domains, significantly improving diagnostic F1 scores and segmentation IoU compared to baseline models.

QoQ-Med refers to a family of multimodal clinical foundation models capable of performing medical reasoning over heterogeneous data, including medical images, time-series physiological signals, and textual patient records. These models are constructed atop a LLM backbone and utilize novel reinforcement learning protocols to address performance imbalances arising from domain and modality heterogeneity. As the first open generalist clinical foundation model enabling joint reasoning across disparate clinical specialties and modalities, QoQ-Med represents a substantial advance in the development and deployment of clinically meaningful AI in medicine (Dai et al., 31 May 2025).

1. Model Architecture and Modalities

QoQ-Med is provided in two parameter scales: 7B and 32B, both built upon a pretrained vision-language backbone (Qwen2.5-VL). The multimodal encoder stack comprises three principal modules:

  • Vision Encoder: A patch-based transformer (such as Swin-V2) encodes each 2D image or each frame of a 3D volumetric scan into a token sequence. These are linearly projected into the LLM's token embedding domain.
  • Time-Series Encoder: For physiological signals such as electrocardiograms (ECG), the ECG-JEPA model yields embedding sequences per time window, each mapped into the unified token space.
  • Text Encoder: Clinical text inputs are tokenized and embedded through the Qwen2.5 mechanisms.

Cross-modal fusion is implemented by sequentially interleaving projected ECG, vision, and text tokens according to their inherent ordering, after which these concatenated tokens are processed by the autoregressive LLM. Transformer's self-attention heads allow for continuous interaction among modalities throughout all transformer blocks. The model's output at each inference step includes a free-text chain-of-thought reasoning trace, a concise diagnosis, and bounding-box tokens localizing salient regions in images.

2. Training Protocol: Domain-aware Relative Policy Optimization (DRPO)

QoQ-Med is trained using Domain-aware Relative Policy Optimization (DRPO), a hierarchical reinforcement learning objective that scales normalized returns to compensate for modality or specialty imbalance and domain rarity. Conventional Group Relative Policy Optimization (GRPO) normalizes rollouts by prompt but ignores domain-wise class imbalance. DRPO introduces two levels of "temperature" scaling to boost the gradient contribution from rare and challenging domains.

The reward rd,mr_{d,m} for each domain-modality pair (d,m)(d,m) is normalized via: R^d,m=rd,mμd,mσd,m+ε\hat R_{d,m} = \frac{r_{d,m} - \mu_{d,m}}{\sigma_{d,m} + \varepsilon} with μd,m\mu_{d,m} and σd,m\sigma_{d,m} the group mean and standard deviation. Domain-specific temperature is computed as: T(d,t)=max(N(d,t)μ(d,t),ε)T_{(d,t)} = \max\left(\sqrt{N_{(d,t)}\mu_{(d,t)}}, \varepsilon\right) and analogous scaling occurs at the cluster level within domains. The overall per-rollout weight is

wd,m=1T(d,t)T(c,d,t)w_{d,m} = \frac{1}{T_{(d,t)} T_{(c,d,t)}}

The DRPO loss for critic-free RLHF updates is: LDRPO=Eτπθ[wd,mR^d,mlogπθ(τ)]L_{\rm DRPO} = -\mathbb{E}_{\tau \sim \pi_\theta} \left[ w_{d,m} \hat R_{d,m} \log \pi_\theta(\tau) \right] This hierarchical scaling gives disproportionate learning weight to underrepresented clinical specialties and complex modalities, mitigating training bias due to skewed real-world data distributions.

3. Dataset Curation and Instruction-Tuning Pipeline

Training and evaluation leverage the CLIMB aggregation (Dai et al, ICML '25), comprising 2.61 million question-answer pairs across nine medical domains:

  • 1D Signals: ECG datasets (PTB-XL, CPSC, Georgia, Chapman-Shaoxing; 78.9K samples)
  • 2D Vision: Chest X-ray (CheXpert, MIMIC-CXR, VinDr, COVID-X), mammography, dermoscopy, fundus, histopathology (various sources)
  • 3D Vision: Ultrasound (BUSI, COVID series), MRI, CT (INSPECT, RSPECT, KiTS23, hemorrhage)

Classification entries are templated into natural-language QA pairs, e.g., "Shown above is a frontal chest radiograph..." followed by expert-derived answer. No additional holdout data are used beyond established train/validation splits. Raw images are size-normalized, 3D volumes are truncated to four slices, and ECGs are windowed per the ECG-JEPA protocol. Crucially, data imbalance is addressed dynamically during training via DRPO's reward scaling, rather than pre-training oversampling.

4. Empirical Performance: Diagnostic F1 and Segmentation

Evaluation adopts macro-F1 scoring across label sets

MacroF1dataset=1LL2PRP+R\mathrm{Macro{-}F1}_\mathrm{dataset} = \frac{1}{|\mathcal L|}\sum_{\ell\in\mathcal L} 2\frac{P_\ell R_\ell}{P_\ell+R_\ell}

where PP_\ell and RR_\ell denote classwise precision and recall, respectively. DRPO achieves a +43% relative improvement in mean macro-F1 across visual domains, elevating scores—for instance—from 0.096 to 0.125 in chest X-ray, 0.058 to 0.253 in mammography, and 0.245 to 0.400 in dermoscopy. This outperforms vanilla GRPO and other critic-free RL protocols.

Segmentation is assessed using intersection-over-union (IoU) between predicted and expert ground-truth boxes: IoU=area(BpredBgt)area(BpredBgt)\mathrm{IoU} = \frac{\mathrm{area}(B_\mathrm{pred} \cap B_\mathrm{gt})}{\mathrm{area}(B_\mathrm{pred} \cup B_\mathrm{gt})} On ultrasound, chest X-ray, and CT hemorrhage, QoQ-Med produces region labels with a mean IoU tenfold greater than open baseline models and matching the closed o4-mini system. Bounding boxes are qualitatively well aligned to diagnostic features.

5. Reproducibility, Releases, and Resources

QoQ-Med facilitates downstream research via comprehensive release at https://github.com/DDVD233/QoQ_Med, encompassing:

  • Full model weights for both 7B and 32B variants
  • Modular DRPO training pipeline (PyTorch + FSDP, vLLM KV-cache integration)
  • All intermediate reasoning traces and bounding-box outputs
  • Fine-tuning and evaluation scripts for macro-F1 and IoU

Released resources enable reproduction, customization for new clinical challenges, and benchmarking, fostering openness in clinical foundation model development.

6. Clinical Implications and Limitations

QoQ-Med demonstrates the feasibility of integrated multimodal reasoning within a single LLM framework, supporting transparent diagnosis and explanation in radiology, cardiology, and pathology. The model produces both predictive outputs and explicit chains of reasoning with visual localization, facilitating auditability in clinical workflow. Limitations observed include lower sample efficiency in extremely data-sparse domains, reliance on automatic QA template construction over clinician-led dialog design, and the necessity for prospective clinical validation to establish safety and efficacy in real applications. Further refinement in data curation and real-world evaluation is indicated for robust translation to clinical practice (Dai et al., 31 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to QoQ-Med Models.