Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Published 8 Jun 2025 in cs.CL, cs.AI, and cs.CV | (2506.07044v4)

Abstract: Multimodal LLMs (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Summary

  • The paper presents Lingshu as a foundation model that unifies multimodal medical data for enhanced understanding and reasoning.
  • It employs a multi-stage training pipeline, including shallow and deep alignment, instruction tuning, and medical-oriented reinforcement learning.
  • The model demonstrates superior benchmark performance and scalability, validated through the MedEvalKit evaluation framework in real-world case studies.

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

The paper presents Lingshu, a foundation model designed for comprehensive multimodal understanding and reasoning within the medical domain. This model addresses critical gaps in existing medical LLMs (MLLMs) through meticulous data curation, innovative training methodologies, and the introduction of a robust evaluation framework, MedEvalKit.

Data Curation Pipeline

Lingshu's development begins with a structured data curation process that gathers a diverse array of open-source medical data, general-domain data, and synthetic medical data (Figure 1). This multi-faceted approach aims to overcome limitations of previous models, which often suffer from restricted medical knowledge coverage and an increased risk of hallucinations due to suboptimal data curation. Figure 1

Figure 1: The overall data curation pipeline of medical multimodal and textual data.

  1. Data Collection: The model integrates multimodal medical datasets, including medical image captions and medical instruction data. Additionally, medical unimodal datasets comprising raw images and text are collected to enhance domain knowledge.
  2. Data Synthesis: To address data scarcity, the paper proposes synthesizing long-form captions and visual question answering (VQA) samples. These efforts increase the diversity and depth of the training dataset, allowing the model to more accurately learn from detailed medical information.

Model Training

The architecture leverages the Qwen2.5-VL model, following a multi-stage training pipeline that adapates the model to the medical domain through a series of incremental learning stages (Figure 2). Figure 2

Figure 2: The training pipeline of Lingshu, consisting of medical shallow alignment, medical deep alignment, medical instruction tuning, and medical-oriented reinforcement learning.

  1. Medical Shallow Alignment: This initial stage focuses on harmonizing the model with medical imagery using coarsely annotated medical image-text pairs.
  2. Medical Deep Alignment: Expanding beyond the foundational vision-language alignment, this stage integrates high-quality datasets and fine-grained textual information to embed deeper domain knowledge.
  3. Medical Instruction Tuning: This phase enhances the model's instruction-following capacity, critical for applying the model in real-world medical use cases.
  4. Medical-oriented Reinforcement Learning: To improve reasoning capabilities, the model uses reinforcement learning methodologies tailored specifically for medical scenarios.

Evaluation with MedEvalKit

MedEvalKit is introduced as a benchmark for evaluating the model's performance across various medical tasks, providing a standardized, scalable, and reproducible evaluation framework (Figure 3). This evaluation framework includes a diverse set of medical benchmarks encompassing multimodal QA, textual QA, and report generation tasks. Figure 3

Figure 3

Figure 3

Figure 3: Medical Multimodal Benchmark overview.

  • Performance Analysis: Lingshu demonstrates superior performance across multiple benchmarks, often outperforming existing open-source models and approaching the outputs of proprietary systems like GPT-4 in some tasks.
  • Scalability: The performance is consistent across different scales of training data, demonstrating robust generalization capabilities (Figure 4).

Case Studies and Real-World Applications

The paper illustrates Lingshu's potential through five detailed case studies that underscore its applicability in practical medical contexts, such as medical diagnosis and report generation (Figures 10 to 15). These examples demonstrate Lingshu's ability to interpret complex medical information and generate accurate, contextually relevant outputs. Figure 5

Figure 5: Case studies on Lingshu in visual question answering across various medical imaging modalities.

Conclusion

Lingshu represents a significant step toward developing robust, generalist MLLMs tailored to the complexities of medical data. By addressing existing limitations through innovative data curation and enrichment strategies, multi-stage training paradigms, and rigorous evaluation, this study suggests pathways for future development in the field of medical AI applications. The integration of reinforcement learning for enhanced reasoning and the proposed MedEvalKit offer a comprehensive approach to evaluating and deploying these advanced models in healthcare settings.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following items summarize what remains missing, uncertain, or unexplored in the paper, with concrete directions for future research and validation.

  • Clinical validation: No prospective or retrospective clinician-in-the-loop studies, reader studies, or outcome-based evaluations demonstrating utility, safety, or impact in real clinical workflows.
  • Hallucination measurement: Absence of standardized hallucination/factuality metrics (e.g., RadGraph F1, CheXpert labeler consistency, clinical entity/relationship correctness) and radiologist adjudication for both VQA and report generation.
  • Uncertainty and calibration: No mechanisms or evaluations of confidence calibration, selective abstention, or deferral-to-human policies—critical for high-stakes medical use.
  • Robustness to acquisition variability: Unassessed robustness across institutions, devices, vendors, imaging protocols, windowing/normalization for CT, and common artifacts (motion, noise, occlusions, compressed/low-resolution scans).
  • 3D volumetric and temporal reasoning: The pipeline reduces 3D volumes to 2D slices; no evaluation on volumetric tasks (lesion volumetry, sequence-level reasoning), longitudinal comparisons with priors, or progression tracking.
  • Real-world OCR representativeness: Synthetic OCR tasks based on school exam images may not reflect clinical OCR challenges (DICOM overlays, annotations, hand-written notes, scanned forms, non-Latin scripts, variable layouts).
  • Safety alignment and guardrails: No explicit refusal policies, safety filters, or evaluation on avoiding harmful recommendations, especially in patient-facing dialogue or decision support contexts.
  • Bias and fairness auditing: No analyses of performance across demographics (age, sex, skin tone), institutions, geographic regions, or modalities; no mitigation strategies or fairness reporting.
  • Data provenance, consent, and licensing: Limited discussion on data governance (consent, licensing, jurisdictional compliance), especially for scraped dialogue datasets and synthesized data derived from proprietary models.
  • Faithfulness of chain-of-thought: CoT traces are LLM-generated and “validated” by another LLM; no human/clinical verification of step-wise correctness, guideline adherence, or causal faithfulness.
  • RLVR details and efficacy: The RL component is preliminary (∼65K samples) with missing details on reward design, verifiability scope (beyond multiple-choice), sample efficiency, and ablations against SFT-only baselines.
  • Scaling RL to non-verifiable tasks: Open question on how to define verifiable rewards for free-form generation (e.g., radiology reports), multi-image reasoning, and complex differential diagnoses.
  • Multilingual coverage: Aside from inclusion of LLaVA-Med-zh, there is no systematic evaluation across languages, code-switching, or cross-lingual medical terminology and abbreviations.
  • Clinically meaningful report evaluation: Unclear whether report generation was scored with clinically grounded metrics (finding-level correctness, impression accuracy, critical findings, actionable recommendations) or accepted by radiologist review.
  • Beyond vision+text modalities: No integration or evaluation with waveforms (ECG/EEG), genomics, lab panels, temporal vitals, or audio (auscultation), which are central to comprehensive clinical reasoning.
  • Deployment constraints: No analysis of inference latency, memory/compute requirements, throughput, or on-premise feasibility (e.g., hospital hardware, privacy-preserving inference).
  • Data leakage and dedup coverage: Deduplication focuses on exact duplicates (perceptual hashing with Hamming distance 0) and known benchmarks; no assessment of near-duplicates or semantic leakage across training and evaluation sets.
  • Extent of expert involvement: “Doctor preference” elicitation is described, but the number, specialties, inter-rater agreement, and validation of synthesized captions against expert ground truth are not reported.
  • Synthetic data dependency: Heavy reliance on GPT-4o/Gemini distillation and synthetic captions/VQA may propagate upstream biases or errors; no quantification of synthetic-vs-human data contributions or their differential impact.
  • Statistical rigor of SOTA claims: Lacks confidence intervals, significance testing, or consistent evaluation protocols across proprietary baselines; reproducibility of claims needs stronger methodological transparency.
  • EHR/PACS integration: No discussion on interfacing with DICOM metadata, PACS systems, EHR context, or how structured clinical information informs multimodal reasoning.
  • High-resolution pathology (WSI) handling: No approach to gigapixel tile-based processing, slide-level aggregation, or evaluation on real whole-slide imaging tasks.
  • Modality balancing and sampling strategy: Although modality distribution is shown, there is no analysis of sampling strategies, potential overfitting to dominant modalities, or performance trade-offs across long-tail modalities.
  • Multi-turn clinical dialogue: Training includes single-turn dialogues; real-world clinical interactions are multi-turn and context-rich; evaluation on multi-turn consistency, memory, and safety is missing.
  • Adversarial robustness and security: No testing against adversarial prompts, jailbreaks, or intentionally misleading medical inputs that could induce unsafe outputs.
  • General-domain data mixing ablations: Insufficient analysis of how general-domain data affects medical task performance (catastrophic forgetting vs. transfer), and whether curriculum or domain-weighting improves outcomes.
  • Pre-/post-processing standardization: No explicit handling of modality-specific preprocessing (e.g., HU windowing for CT, intensity normalization for MRI), which impacts performance and generalization.
  • Continual learning and drift: No strategy for updating the model under distribution shift (new scanners, protocols) while preventing forgetting, or for continuous safety monitoring.
  • Transparency and reproducibility: Limited release details on full training mixtures, cleaning scripts, synthesis prompts, RL reward code, and MedEvalKit evaluation configs to enable independent replication.
  • Regulatory readiness: No pathway or plan for clinical certification (e.g., FDA/CE), risk management, post-market surveillance, or alignment with medical device regulations.

Glossary

  • Ablation studies: Controlled experiments that remove or vary components to measure their impact on model performance. "Through comprehensive ablation studies, we further confirm the importance of data quality and medical knowledge coverage on overall performance"
  • Anatomical recognition: Identifying anatomical structures in medical images. "VQA samples for tasks such as diagnosis, anatomical recognition, and modality classification"
  • Bounding boxes: Rectangular annotations that localize objects or regions in images. "Bounding boxes are directly overlaid on the original images"
  • Chain-of-thought (CoT): Explicit step-by-step reasoning traces generated alongside answers. "chain-of-thoughts (CoTs) reasoning samples"
  • Data contamination: Overlap between training and evaluation data that can inflate reported performance. "To prevent data contamination, we perform strict image and text deduplication"
  • Dermoscopy: Imaging of skin lesions using a dermatoscope. "PAD-UFES-20~\citep{PACHECO2020106221} for Dermoscopy"
  • End-to-end fine-tuning: Updating all model components jointly during training on task data. "are unfrozen to allow end-to-end fine-tuning"
  • Fundus: Retinal imaging of the interior surface of the eye. "Fundus & BRESET~\citep{nakayama2023brazilian}"
  • Hallucinations: Confident yet incorrect or fabricated model outputs. "heightened susceptibility to hallucinations due to suboptimal data curation processes"
  • Histopathology: Microscopic examination of tissue to study disease. "and EBHI-Seg~\citep{shi2023ebhi} for Histopathology"
  • Imaging modality: A type of medical imaging (e.g., X-ray, CT, MRI) with distinct acquisition characteristics. "over 12 medical imaging modalities"
  • Min-hash locality-sensitive hashing (LSH): A technique for efficient near-duplicate detection via hash-based similarity. "using min-hash locality-sensitive hashing (LSH), retaining only the highest-quality version of each near-duplicate"
  • Modality classifier: A model that predicts the imaging modality of an input image. "we train a modality classifier based on BiomedCLIP"
  • Multimodal LLMs (MLLMs): LLMs that process and reason over multiple modalities such as text and images. "Multimodal LLMs (MLLMs) have demonstrated impressive capabilities"
  • Optical Character Recognition (OCR): Automatically extracting text from images. "OCR-based instruction datasets enhance the model's ability to interpret textual information embedded within medical images"
  • Perceptual hashing: Hashing that maps images to fingerprints reflecting visual similarity. "we apply perceptual hashing with a strict Hamming distance threshold of zero to detect and remove exact duplicate images"
  • Radiology reports: Structured narratives describing imaging findings and clinical impressions. "radiology reports that detail clinical findings and impressions"
  • Region of Interest (RoI): A specific area in an image relevant to analysis or diagnosis. "RoIs are available as either segmentation masks or bounding boxes"
  • Segmentation masks: Pixel-level labels outlining regions or structures in images. "segmentation masks are converted into bounding boxes by computing the minimal enclosing rectangles"
  • Self-instruct-based method: Generating training data by prompting models with seed examples to create new instructions or questions. "Self-instruct-based method"
  • State-of-the-art (SOTA): The best reported performance to date on a task or benchmark. "state-of-the-art (SOTA) performance"
  • Supervised Fine-Tuning (SFT): Training a model on labeled instruction data to improve capabilities. "supervised fine-tuning (SFT) on medical instruction data is effective in injecting medical knowledge into LLMs"
  • Vision encoder: The component that converts images into feature representations for the LLM. "a vision encoder"
  • Vision-language alignment: Training to map visual features and text into a shared semantic space. "for vision-language alignment and supervised fine-tuning"
  • Visual Question Answering (VQA): Answering questions about images, requiring visual understanding and reasoning. "visual question answering (VQA)"
  • Whole Slide Image (WSI): High-resolution digitized pathology slide. "A WSI image of carcinogenic DNA damage caused by ultraviolet (UV) radiation"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 17 likes about this paper.