- The paper introduces MAGEN and O-MAKE, a dual-stage system that generates and aligns rich medical image-text data using multi-agent collaboration and ontology-based techniques.
- The paper demonstrates significant improvements in zero-shot disease classification and rare disease recognition, achieving 46–54% average accuracy and robust performance across benchmarks.
- The paper validates its methodology through extensive ablation studies and qualitative analysis, showing enhanced intra-class compactness and inter-class separability in representation learning.
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining via Multi-Agent Data Generation
Introduction
This work targets fundamental deficiencies in the current paradigm of medical vision-language pretraining (VLP), specifically within dermatology. While supervised deep learning has achieved notable advances in automated medical image analysis, scaling such systems is bottlenecked by the cost and scarcity of high-quality, manually annotated labels. VLP, which aligns visual and textual modalities from web-crawled image-text pairs, is a promising alternative, especially for zero-shot generalization. However, leveraging web-scale data introduces two persistent challenges: poor and noisy data quality, including misaligned and superficially described captions, and suboptimal exploitation of complex, multifaceted medical knowledge embedded in long, unstructured clinical texts.
This paper introduces an integrative pipeline with two key components. First, a Multi-Agent data GENeration (MAGEN) system systematically synthesizes high-quality, knowledge-dense captions using a tool-based agent collaboration and verification framework, rectifying the prevalent sparsity and misalignment of raw captions. Second, Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining enables fine-grained, aspect-wise alignment between images and knowledge-rich texts, explicitly modeling intra- and inter-disease relationships via clinical ontologies. This approach results in superior zero-shot disease classification, rare disease recognition, and cross-modal retrieval, validated across eight dermatology benchmarks.
Methodology
Multi-Agent Data Generation (MAGEN)
MAGEN operates as a multi-stage, tool-augmented agent system to generate high-fidelity image captions:
- Foundation Model Diagnostic Prior: A dermatology-specific vision-language foundation model generates the top-5 differential diagnoses per input image, providing explicit diagnostic priors.
- Captioning Agent: An LLM-based (LLaVA framework, with PanDermV2 and Qwen3-14B) multimodal captioning agent, trained on a large instruction dataset, constructs initial clinical narratives guided by diagnostic priors.
- Summary Agent and Structured Knowledge Base: Disease Cards distill critical morphological, anatomical, and discriminative features for 371 diseases, using automated web curation and summarization via Qwen2.5-72B-Instruct.
- Verification Agent: A RAG pipeline (Qwen2.5-VL-72B) triangulates input image evidence, preliminary captions, and relevant Disease Cards. Through multi-step reasoning, morphological claims are validated or corrected; if none of the candidates matches visual evidence, the system abstains from forced diagnoses.
MAGEN selectively processes the lowest-quality pairs (defined via image-text semantic similarity), maximizing data quality improvements with computational efficiency. Empirical ablation shows that the foundation model supplies the largest single boost, while verification yields incremental but complementary gains.
Ontology-Based Multi-Aspect Knowledge-Enhanced Pretraining (O-MAKE)
O-MAKE defines a pretraining framework that extracts and exploits various granular knowledge facets within image-text pairs:
- Multi-Aspect Knowledge Decomposition: Each caption is split into: raw, ontology-based disease path (hierarchical taxonomy), visual concept (morphology-based), and sentence-level sub-captions. LLMs automate this decomposition beyond text encoder context limits.
- Ontology-Guided Weighting: Adaptive importance is assigned to subcaptions based on their semantic similarity to ontology embeddings, prioritizing diagnostically salient text fragments.
- Multi-Knowledge Image Alignment: Using multi-positive contrastive objectives, global image embeddings are simultaneously aligned with all aspect-level text embeddings, incorporating their adaptive weights.
- Ontology-Based Soft-Label Learning: Intra-batch semantic similarity matrices are computed using disease taxonomy, allowing the model to treat ontologically proximate samples as "softer" negatives, facilitating inter-disease knowledge transfer.
- Fine-Grained Patch-Level Alignment: Knowledge-enhanced global image embeddings aggregate patch representations weighted by their textual similarity; sentence-level sub-caption embeddings are aligned to these for local correspondence.
The total loss is a weighted sum of the global multi-knowledge contrastive and patch-level fine-grained alignment.
Experimental Results
Zero-Shot Disease Classification and Retrieval
O-MAKE achieves strong numerical improvements across all benchmark tasks:
- Disease Classification: On multi-class (113–134 categories) benchmarks, O-MAKE attains 46–54% average zero-shot accuracy, outperforming all specialty and biomedical VLP baselines (e.g., DermLIP-PanDerm) by 5–9%, and its prior version (MAKE) by 6–8%.
- Long-Tail Recognition: Robustness is especially marked for rare classes, with 50.8% average accuracy, >10% absolute improvement over certain retrained baselines.
- Cross-Modal Retrieval: On SkinCAP, O-MAKE achieves recall@N metrics up to 45.6%, exceeding both open- and domain-specific models by 4–5%.
Ablation and Representation Quality Analysis
Ablation studies confirm the positive contributions of each stage. The transition from CLIP to multi-positive multi-knowledge alignment provides the largest performance leap; fine-grained and ontology-based modules offer further cumulative improvements, particularly for rare diseases. t-SNE analysis reveals substantially improved intra-class compactness and inter-class separability for O-MAKE over strong baselines, supporting claims of enhanced representation learning.
Qualitative Evaluation of MAGEN
Case studies demonstrate MAGEN's ability to:
- Add critical clinical detail to sparse textbook captions,
- Correct ambiguous, morphology-only PubMed descriptions with explicit diagnostic information,
- Realign mis-captioned YouTube frames to correct morphological labels matching the visual evidence.
MAGEN achieves a consistent transformation from noisy, knowledge-poor captions to integrated, comprehensive clinical narratives without manual intervention.
Implications and Future Directions
This dual-stage framework alleviates core bottlenecks in medical VLP: scarce, high-quality annotation and suboptimal knowledge utilization. With strong generalization to rare diseases and long-tail classes, explicit multi-aspect modeling via ontologies greatly benefits structured knowledge transfer, a point underscored by significant improvements in ontologically complex domains such as dermatology.
The modularity of MAGEN and O-MAKE allows extension to other medical image-text modalities (e.g. radiology, pathology) and even incorporation of additional data streams (e.g. structured EHR, longitudinal clinical narratives). Future research may focus on dynamic ontology construction, automated adaptation to emerging diseases, and integration with foundation models for clinical decision support.
Conclusion
By combining automated, agent-driven knowledge-rich data synthesis with ontology-aware, multi-aspect contrastive learning, this work delivers a robust, scalable framework for medical vision-language pretraining. It demonstrates significant, systematic improvements in zero-shot classification, rare disease recognition, and cross-modal retrieval across diverse dermatology tasks, with empirical support for each integrated methodological innovation. The framework has potential broad applicability across medical AI, notably in specialty domains with highly imbalanced, complex label structures.