Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation (2512.03445v1)

Published 3 Dec 2025 in cs.CV and cs.AI

Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

Summary

The paper introduces MAGEN and O-MAKE, a dual-stage system that generates and aligns rich medical image-text data using multi-agent collaboration and ontology-based techniques.
The paper demonstrates significant improvements in zero-shot disease classification and rare disease recognition, achieving 46–54% average accuracy and robust performance across benchmarks.
The paper validates its methodology through extensive ablation studies and qualitative analysis, showing enhanced intra-class compactness and inter-class separability in representation learning.

Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining via Multi-Agent Data Generation

Introduction

This work targets fundamental deficiencies in the current paradigm of medical vision-language pretraining (VLP), specifically within dermatology. While supervised deep learning has achieved notable advances in automated medical image analysis, scaling such systems is bottlenecked by the cost and scarcity of high-quality, manually annotated labels. VLP, which aligns visual and textual modalities from web-crawled image-text pairs, is a promising alternative, especially for zero-shot generalization. However, leveraging web-scale data introduces two persistent challenges: poor and noisy data quality, including misaligned and superficially described captions, and suboptimal exploitation of complex, multifaceted medical knowledge embedded in long, unstructured clinical texts.

This paper introduces an integrative pipeline with two key components. First, a Multi-Agent data GENeration (MAGEN) system systematically synthesizes high-quality, knowledge-dense captions using a tool-based agent collaboration and verification framework, rectifying the prevalent sparsity and misalignment of raw captions. Second, Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining enables fine-grained, aspect-wise alignment between images and knowledge-rich texts, explicitly modeling intra- and inter-disease relationships via clinical ontologies. This approach results in superior zero-shot disease classification, rare disease recognition, and cross-modal retrieval, validated across eight dermatology benchmarks.

Methodology

Multi-Agent Data Generation (MAGEN)

MAGEN operates as a multi-stage, tool-augmented agent system to generate high-fidelity image captions:

Foundation Model Diagnostic Prior: A dermatology-specific vision-language foundation model generates the top-5 differential diagnoses per input image, providing explicit diagnostic priors.
Captioning Agent: An LLM-based (LLaVA framework, with PanDermV2 and Qwen3-14B) multimodal captioning agent, trained on a large instruction dataset, constructs initial clinical narratives guided by diagnostic priors.
Summary Agent and Structured Knowledge Base: Disease Cards distill critical morphological, anatomical, and discriminative features for 371 diseases, using automated web curation and summarization via Qwen2.5-72B-Instruct.
Verification Agent: A RAG pipeline (Qwen2.5-VL-72B) triangulates input image evidence, preliminary captions, and relevant Disease Cards. Through multi-step reasoning, morphological claims are validated or corrected; if none of the candidates matches visual evidence, the system abstains from forced diagnoses.

MAGEN selectively processes the lowest-quality pairs (defined via image-text semantic similarity), maximizing data quality improvements with computational efficiency. Empirical ablation shows that the foundation model supplies the largest single boost, while verification yields incremental but complementary gains.

Ontology-Based Multi-Aspect Knowledge-Enhanced Pretraining (O-MAKE)

O-MAKE defines a pretraining framework that extracts and exploits various granular knowledge facets within image-text pairs:

Multi-Aspect Knowledge Decomposition: Each caption is split into: raw, ontology-based disease path (hierarchical taxonomy), visual concept (morphology-based), and sentence-level sub-captions. LLMs automate this decomposition beyond text encoder context limits.
Ontology-Guided Weighting: Adaptive importance is assigned to subcaptions based on their semantic similarity to ontology embeddings, prioritizing diagnostically salient text fragments.
Multi-Knowledge Image Alignment: Using multi-positive contrastive objectives, global image embeddings are simultaneously aligned with all aspect-level text embeddings, incorporating their adaptive weights.
Ontology-Based Soft-Label Learning: Intra-batch semantic similarity matrices are computed using disease taxonomy, allowing the model to treat ontologically proximate samples as "softer" negatives, facilitating inter-disease knowledge transfer.
Fine-Grained Patch-Level Alignment: Knowledge-enhanced global image embeddings aggregate patch representations weighted by their textual similarity; sentence-level sub-caption embeddings are aligned to these for local correspondence.

The total loss is a weighted sum of the global multi-knowledge contrastive and patch-level fine-grained alignment.

Experimental Results

Zero-Shot Disease Classification and Retrieval

O-MAKE achieves strong numerical improvements across all benchmark tasks:

Disease Classification: On multi-class (113–134 categories) benchmarks, O-MAKE attains 46–54% average zero-shot accuracy, outperforming all specialty and biomedical VLP baselines (e.g., DermLIP-PanDerm) by 5–9%, and its prior version (MAKE) by 6–8%.
Long-Tail Recognition: Robustness is especially marked for rare classes, with 50.8% average accuracy, >10% absolute improvement over certain retrained baselines.
Cross-Modal Retrieval: On SkinCAP, O-MAKE achieves recall@N metrics up to 45.6%, exceeding both open- and domain-specific models by 4–5%.

Ablation and Representation Quality Analysis

Ablation studies confirm the positive contributions of each stage. The transition from CLIP to multi-positive multi-knowledge alignment provides the largest performance leap; fine-grained and ontology-based modules offer further cumulative improvements, particularly for rare diseases. t-SNE analysis reveals substantially improved intra-class compactness and inter-class separability for O-MAKE over strong baselines, supporting claims of enhanced representation learning.

Qualitative Evaluation of MAGEN

Case studies demonstrate MAGEN's ability to:

Add critical clinical detail to sparse textbook captions,
Correct ambiguous, morphology-only PubMed descriptions with explicit diagnostic information,
Realign mis-captioned YouTube frames to correct morphological labels matching the visual evidence.

MAGEN achieves a consistent transformation from noisy, knowledge-poor captions to integrated, comprehensive clinical narratives without manual intervention.

Implications and Future Directions

This dual-stage framework alleviates core bottlenecks in medical VLP: scarce, high-quality annotation and suboptimal knowledge utilization. With strong generalization to rare diseases and long-tail classes, explicit multi-aspect modeling via ontologies greatly benefits structured knowledge transfer, a point underscored by significant improvements in ontologically complex domains such as dermatology.

The modularity of MAGEN and O-MAKE allows extension to other medical image-text modalities (e.g. radiology, pathology) and even incorporation of additional data streams (e.g. structured EHR, longitudinal clinical narratives). Future research may focus on dynamic ontology construction, automated adaptation to emerging diseases, and integration with foundation models for clinical decision support.

Conclusion

By combining automated, agent-driven knowledge-rich data synthesis with ontology-aware, multi-aspect contrastive learning, this work delivers a robust, scalable framework for medical vision-language pretraining. It demonstrates significant, systematic improvements in zero-shot classification, rare disease recognition, and cross-modal retrieval across diverse dermatology tasks, with empirical support for each integrated methodological innovation. The framework has potential broad applicability across medical AI, notably in specialty domains with highly imbalanced, complex label structures.