MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Published 12 Apr 2026 in cs.CV | (2604.10755v1)

Abstract: Multimodal LLMs (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces a novel benchmark that integrates multiple imaging modalities with longitudinal clinical narratives for rare diseases.
It employs a five-stage pipeline to extract, filter, and mask evidence from thousands of real-world case reports ensuring high diagnostic integrity.
Experimental results reveal significant performance gaps in treatment planning and cross-modal evidence alignment, highlighting the need for advanced MLLM strategies.

MMRareBench: A Benchmark for Multimodal Multi-Image Reasoning in Rare Diseases

Motivation and Context

Recent advances in multimodal LLMs (MLLMs) have been disproportionately validated in clinical tasks featuring common diseases and single-image contexts. However, rare diseases, which collectively impact a significant global population, present unique challenges stemming from severe class imbalance, phenotypic overlap, and systematic data scarcity. Clinical workflows for rare diseases require rigorous, evidence-grounded reasoning: integrating longitudinal narratives, heterogeneous data types, and multiple medical images. Benchmarking efforts to date fail to address the intersection of multimodality and multi-image reasoning under the rare-disease constraint. "MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark" (2604.10755) introduces a systematic framework to address this critical gap.

Benchmark Design and Formalism

MMRareBench comprises 1,756 QA pairs tethered to 7,958 medical images and covers 403 rare diseases, each mapped to Orphanet-anchored diagnoses. Benchmark construction leverages a five-stage pipeline: extracting relevant evidence slices from 14,700 real-world PMC case reports and employing LLM-assisted curation for disease and modality extraction, followed by ontology-aligned filtering. To counter annotation leakage (i.e., situations where direct diagnostic cues could enter the model context), the authors implement a cascade of masked views, removing explicit diagnostic mentions and evidence-hinting content at multiple document layers. Cases undergo automated auditing for integrity and residual leakage, with a high rejection rate ensuring only non-trivial, evidence-grounded cases enter the final benchmark.

The evaluation protocol is workflow-aligned and structured into four clinical tracks:

T1: Diagnosis — models must predict the masked rare-disease diagnosis from de-identified narratives and accompanying images.
T2: Treatment Planning — given the working diagnosis, models must propose detailed, stage-wise treatment and management strategies.
T3: Cross-Image Evidence Alignment — models are tasked with (i) per-image finding extraction, (ii) evidence correlation across imaging modalities, and (iii) integrated clinical synthesis.
T4: Examination Suggestion — models must recommend the next-step diagnostic workup (with rationale), under conditions where conclusive tests are masked.

Each QA pair includes an annotation schema with evidence chains citably anchored to document blocks, supporting strict model evaluation. Scoring proceeds via a dual-level protocol: track-specific rubric-based model grading (with Qwen3-VL-235B as the judge) and deterministic token-level F1 computation (notably in diagnosis).

Experimental Results

Systematic benchmarking of 23 state-of-the-art MLLMs, including both closed and open-source models as well as nine medical-specialized architectures, reveals profound capability gaps:

Universal Bottleneck in Treatment Planning: Track T2 exhibits the lowest peak performance (maximum 49.2), attributed to the compounded requirement of synthesizing diagnosis, imaging, and clinical judgment under weak priors—a scenario for which standard-of-care protocols are scarce in rare disease corpora.
Fragmentation of Cross-Modal Competence: While medical fine-tuned models approach general-purpose MLLMs in diagnostic tasks (T1), their performance on multi-image reasoning (T3) lags severely—by 29 points compared to leading generalist models, and more than 43 points relative to closed models. This delineates a capacity dilution phenomenon wherein further specialization in medical text and single-image contexts appears to degrade compositional evidence integration, especially in rare-disease multi-image workflows.
Inconsistency Across Tracks: No model achieves first-rank performance on more than two tracks, indicating current architectures lack unified capacity for rare-disease clinical workflows. High single-task scores do not grant robust generalization over the benchmark, highlighting the non-triviality of requisite capabilities.

Detailed analysis further reveals that model scaling improves performance within medical MLLMs, but does not resolve the fundamental deficit in multi-image and multimodal evidence correlation.

Implications and Future Work

MMRareBench establishes a rigorously controlled, scalable testbed for evaluating MLLMs on clinically critical, evidence-intensive rare-disease tasks. The findings challenge the core assumption that medical fine-tuning uniformly improves clinical robustness: while it narrows diagnostic error on canonical tasks, it may reduce generalization in compositional, cross-modal reasoning. This signals urgent need for research into architectures and training regimes that enhance modular evidence synthesis—potentially involving parameter-efficient multimodal adapters, cross-modal attention re-targeting, or explicit compositionality supervision.

Practically, deployment of MLLMs for rare disease support should not presume universal improvements from medical specialization, particularly in workflows requiring integration across document blocks and temporal or modal boundaries.

Theoretically, MMRareBench foregrounds the challenge of capability transfer and catastrophic forgetting in high-dimensional, compositional domains. Future research directions include meta-learning strategies for improved evidence amalgamation, integration of longitudinal patient records, and simulation of real-time interactive diagnostic workflows to test adaptive model competence.

Conclusion

MMRareBench presents a methodologically robust, evidence-grounded benchmark for rare-disease clinical workflows with joint multi-image and multimodal evaluation—an advancement not realized in prior resources. Experiments demonstrate that existing MLLMs, both general and specialized, manifest fragmented reasoning profiles under these conditions. The observed capacity dilution effect raises non-trivial questions for MLLM specialization strategies, especially regarding multi-image evidence integration crucial for rare disease management. The community is poised to benefit from MMRareBench as a standardized, leakage-controlled resource to drive the next frontier of multimodal clinical AI research (2604.10755).