Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Published 24 Nov 2025 in cs.CV and cs.AI | (2511.18676v1)

Abstract: Current vision-LLMs (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.

Summary

  • The paper presents a comprehensive dataset and benchmark designed to evaluate quantitative tasks in anatomical detection, tumor/lesion size estimation, and angle/distance measurement.
  • It demonstrates that supervised fine-tuning significantly improves metrics such as IoU, MRE, and MAE, enhancing the performance of vision-language models.
  • Despite these improvements, challenges remain in detecting small anatomical structures and precisely measuring minute angles, highlighting areas for future research.

MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Introduction

Medical decision-making predominantly relies on not just qualitative but quantitative assessments. Despite the proliferation of vision-LLMs (VLMs) which can seamlessly integrate visual inputs with language outputs, current models primarily excel in categorical and qualitative tasks. MedVision aims to fill the gap in quantitative reasoning capabilities by providing a comprehensive dataset and benchmarking platform. This platform focuses on three key tasks in medical image analysis: detection of anatomical structures, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement. Figure 1

Figure 1: Overview of MedVision dataset construction and benchmark design (left), and dataset summary (right).

Dataset and Benchmark Design

MedVision aggregates a whopping 30.8 million image-annotation pairs from 22 diverse public datasets spanning various anatomies and imaging modalities. It provides structured annotations aimed explicitly at quantitative tasks. The benchmark design involves three specific tasks:

  1. Detection of Anatomical Structures: Involves localization and identification of healthy anatomical structures and abnormalities.
  2. T/L Size Estimation: Focuses on bidirectional dimensions of tumors/lesions, crucial for disease monitoring.
  3. A/D Measurement: Involves calculating angles and distances, underlining critical geometric relationships in medical contexts.

The dataset is finely split into training and testing sets for rigorous model evaluation and supervised fine-tuning allows VLMs to calibrate their numeric reasoning abilities. Figure 2

Figure 2: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded. The largest diameter and its perpendicular diameter are measured.

Results Analysis

Detection Performance

Baseline VLMs generally exhibit high success rates in generating bounding-box coordinates but falter in localization precision, with IoU metrics showing limitations below 15% for healthy anatomical structures and below 10% for tumors/lesions. Supervised fine-tuning drastically improves recall, precision, and IoU metrics, but small objects remain challenging for these models. Figure 3

Figure 3: SFT improves VLM performance on healthy anatomical structures and tumor/lesion detection tasks, while the latter remains more challenging.

OOD performance showcases better generalization in SFT models, highlighting their ability to adapt across different datasets and target abnormalities. Figure 4

Figure 4: Detection performance on plane-OOD (left) and target-OOD (right) data. Tumor/lesion (T/L) labels are shown in color. The SFT model shows better generalizability.

T/L Size Estimation

Pretrained VLMs exhibit substantial errors in T/L size estimation tasks, however, post SFT, models achieve notable MRE reductions from over 50% to approximately 30%. This implies enhanced measurement accuracy post-training. Figure 5

Figure 5: SFT improves T/L size estimation performance.

A/D Measurement Performance

Angle and distance measurement tasks reveal that baseline models often miscalculate these metrics, yet SFT enhances predictions, reducing MAE to below 4.5 mm for distances and 4° for angles. The challenge persists in predicting smaller angles accurately. Figure 6

Figure 6: SFT improves VLM performance in A/D size measurement tasks.

Conclusion

MedVision sets a foundational benchmark in quantitative reasoning for medical VLMs. Despite enhancing performance with supervised fine-tuning, challenges persist in small structure detection and measurement. By providing a large-scale dataset and structured quantitative tasks, MedVision enables a pathway forward for model developers seeking to strengthen quantitative capabilities in medical image analysis. Future research should aim at further addressing these persistent limitations, refining VLM architectures and training objectives to produce calibrated numeric outputs consistently anchored in clinical relevance.

Paper to Video (Beta)

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.