TemMed-Bench: Temporal Medical Analysis

Updated 30 September 2025

TemMed-Bench is a benchmark for longitudinal medical image analysis that assesses LVLMs using paired historical and current images.
It incorporates three tasks—VQA, report generation, and image-pair selection—to quantify model accuracy and reasoning in clinical change detection.
Retrieval augmentation with a comprehensive knowledge corpus notably enhances performance, highlighting the potential for advanced multi-modal fusion.

TemMed-Bench is a specialized evaluation benchmark designed to measure the temporal reasoning capabilities of large vision-LLMs (LVLMs) in analyzing changes across medical images acquired during different clinical visits. Unlike conventional medical VQA datasets, which typically focus on single-visit image interpretation, TemMed-Bench introduces multi-image temporal tasks that reflect the longitudinal nature of real-world clinical assessment. By presenting LVLMs with paired historical and current images, alongside clinically relevant tasks and a large auxiliary knowledge corpus, TemMed-Bench establishes a foundation for rigorous benchmarking of model performance in longitudinal medical image analysis (Zhang et al., 29 Sep 2025).

1. Task Structure and Dataset Composition

TemMed-Bench consists of three core evaluative tasks, each engineered to probe distinct aspects of temporal medical reasoning:

Visual Question Answering (VQA): Each instance supplies a historical image ( $I_h$ ), a current image ( $I_c$ ), and a clinically grounded question about observed changes. Response candidates are binary (“yes”/“no”), focusing the model on discriminating the presence or absence of specific condition changes between visits.
Report Generation: LVLMs receive both historical and current images plus an instruction, and are required to generate a detailed report summarizing the differences in the patient’s medical condition. This task evaluates a model’s ability not only to detect changes but to articulate them coherently in domain-appropriate language.
Image-Pair Selection: Given three candidate image pairs and a textual statement about a particular condition change, the model must select the pair that best aligns with the statement. This necessitates multi-image comparison and high-level, fine-grained reasoning.

A supplementary knowledge corpus of over 17,000 instances (each an image pair annotated with a corresponding condition-change report) is provided. This corpus underpins retrieval augmentation experiments, enabling additional context to be leveraged by LVLMs to support answer generation.

2. Evaluation Results and Quantitative Metrics

The benchmark supports comparative evaluation across diverse LVLMs, including six proprietary and six open-source models. Experimental results reveal several key trends:

General Performance: Most LVLMs perform near the random guessing level for temporal reasoning tasks; e.g., VQA accuracy typically falls below 60%. This indicates substantial difficulty in analyzing multi-visit progression or regression of clinical findings.
Top Models: Proprietary models such as GPT o4-mini attain relatively higher performance, with VQA accuracy reported at 79.15% and mean report generation scores (BLEU ≈ 20.54, ROUGE-L ≈ 15.75) that outperform open-source counterparts. Accuracy for image-pair selection remains limited (33–39%).
Retrieval Augmentation Impact: Multi-modal retrieval (pairing visual and textual information from the corpus) consistently boosts task performance. HealthGPT, for instance, demonstrates a greater than 10% improvement in VQA accuracy with multi-modal augmentation. Across open-source models, the average VQA accuracy improvement due to multi-modal retrieval is 2.59%.

The table below organizes evaluation metrics for representative tasks:

Task	Metric	Best Model Performance
VQA	Accuracy	79.15% (GPT o4-mini)
Report Generation	BLEU	20.54 (GPT o4-mini)
Report Generation	ROUGE-L	15.75 (GPT o4-mini)
Image-Pair Selection	Accuracy	39.33% (Gemini 2.5)

3. Limitations of Current LVLMs

The benchmark exposes several challenges:

Temporal Reasoning Deficits: LVLMs trained primarily on single-visit data exhibit limited capability in tracking and elucidating clinical change over time, with most systems unable to differentiate subtle progression vs. regression.
Random-Guess Performance: In closed-book settings (without retrieval), multiple models—including specialized medical LVLMs—operate at levels indistinguishable from random selection for key tasks.
Multi-Image Fusion Complexity: Image-pair selection challenges attention allocation and cue reconciliation when confronted with multiple candidate pairs and conflicting cues, highlighting architectural inefficiencies.
Erosion of General Reasoning Post Fine-Tuning: Fine-tuning on medical domains can degrade the broad reasoning abilities of LVLMs. In several instances, open-source medical LVLMs do not outperform general-domain models, suggesting a trade-off in domain adaptation.

4. Retrieval Augmentation Methodology

Retrieval augmentation leverages the auxiliary corpus to mitigate core reasoning deficits:

Formulation: For a given pair $(I_h, I_c)$ and a query, related image pairs $(I_h^*, I_c^*)$ and reports $(t^*)$ are retrieved using cosine similarity between image encoder outputs:

$\text{Score} = \text{Sim}(\text{Enc}_i(I_h), \text{Enc}_i(I_h^*)) + \text{Sim}(\text{Enc}_i(I_c), \text{Enc}_i(I_c^*))$

where $\text{Enc}_i$ is the image encoder and $\text{Sim}$ denotes cosine similarity.

Empirical Performance: Ablation studies confirm that pairwise image retrieval (considering both historical and current images) outperforms approaches based on single-image or text-only retrieval. Multi-modal retrieval augmentation produces the largest accuracy gains across VQA and report generation.

5. Implications for Modeling and Clinical Decision Support

TemMed-Bench situates evaluation within the real-world clinical workflow, where assessment of disease progression or therapeutic effect relies on longitudinal imaging. LVLMs that succeed on TemMed-Bench tasks are better positioned for applications in automated follow-up, disease monitoring, and diagnostic support.

Longitudinal Image Analysis: A plausible implication is that incorporating temporal context and multi-modal retrieval could enable LVLMs to serve as adjuncts in clinical settings for tracking patient trajectories.
Modeling Directions: Enhanced fusion mechanisms, attention schemas, and longitudinal training datasets may address temporal reasoning deficits.
Adaptation Strategies: Hybrid pre-training regimes are suggested to balance medical domain expertise with retention of general reasoning capabilities.

6. Future Directions and Benchmark Prospects

The future development of TemMed-Bench, informed by its initial evaluation, includes:

Temporal Reasoning Training Regimes: Specialized datasets and objective functions that reward interpretation of patient condition evolution across visits.
Advanced Multi-Modal Fusion: Architectures that seamlessly integrate and resolve cross-modal, multi-instance cues for improved image-pair discrimination and change reporting.
Clinical System Integration: Robust deployment of LVLMs evaluated on TemMed-Bench for use in decision support systems requiring reliable temporal change detection.

TemMed-Bench establishes a rigorous, clinically relevant standard for the assessment of temporal image reasoning in LVLMs. Ongoing work aims to bridge the performance gap between current models’ capabilities and real-world requirements in longitudinal medical imaging (Zhang et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to TemMed-Bench.