SurgMLLMBench: Multimodal Surgical Benchmark

Updated 3 December 2025

SurgMLLMBench is a multimodal surgical benchmark that unifies pixel-level instrument segmentation with structured VQA annotations under a unified taxonomy.
It aggregates data from six major sources covering laparoscopic, robot-assisted, and micro-surgical domains, totaling over 560,000 annotated frames.
It supports rigorous evaluation tasks including workflow recognition, action detection, pixel segmentation, and interactive visual-conversational reasoning.

SurgMLLMBench is a comprehensive multimodal benchmark dataset created to advance the development and evaluation of interactive multimodal LLMs in surgical scene understanding. It uniquely integrates pixel-level instrument segmentation masks with structured visual question answering (VQA) annotations, covering a wide range of surgical domains—including laparoscopic, robot-assisted, and micro-surgical procedures—under a unified taxonomy. SurgMLLMBench addresses prior limitations of surgical AI benchmarks that focused primarily on VQA with heterogeneous taxonomies and lacked dense segmentation, enabling a richer, more consistent evaluation of visual-conversational and segmentation capabilities of LLMs across surgical workflows (Choi et al., 26 Nov 2025).

1. Dataset Composition and Domains

SurgMLLMBench aggregates data from six major sources, providing unprecedented coverage of multiple surgical modalities:

Laparoscopic Surgery (LS): Cholec80 dataset (184,498 frames)
Robot-Assisted Surgery (RAS): EndoVis2018 (2,235), AutoLaparo (83,243), GraSP (116,515)
Micro-Surgical Training (MST): MISAW and its supplement (164,275), and the newly introduced MAVIS (Micro-surgical Artificial Vascular anastomosIS; 10,652 frames)

The MAVIS subset is notable for its capture of full vascular anastomosis sequences on 1 mm artificial vessels, annotated hierarchically according to stage, phase, and step. In total, SurgMLLMBench provides 561,418 annotated frames.

Original train/validation/test splits of the constituent datasets are preserved. MAVIS is entirely held out during cross-domain instruction tuning and only introduced for generalization evaluation or fine-tuning assessment. This ensures realistic evaluation of cross-domain transfer and benchmarking of domain generalization effects.

2. Annotation Modalities and Unified Taxonomic Structure

To enable comprehensive training and evaluation, SurgMLLMBench provides two major annotation modalities, both aligned with a cross-dataset, canonical taxonomy:

Pixel-Level Instrument Segmentation:
- Detailed polygonal COCO-format masks are provided for every identifiable instrument or scene element in each frame.
- A universal taxonomy is enforced across all datasets, including categories such as forceps, scissors, clamps, needle holder, vessel, needle, thread, and background material.
- In MAVIS, eight instrument classes are defined explicitly: Background material, Forceps, Scissors, Vascular clamps, Needle holder, Vessel, Needle, Thread.
Structured VQA Annotations:
- Each frame is paired with one or more questions generated from five fixed templates:
- 1. Workflow queries (“Which stage, phase, and step are shown?”)
- 2. Instrument count
- 3. Instrument type identification
- 4. Instrument action queries
- 5. Dataset source
- All questions are deterministic and answerable with short, consistent strings or integers. This approach eliminates label drift across domains (e.g., “Suturing” labels are identical in all domains).

3. Benchmark Tasks and Evaluation Protocols

SurgMLLMBench supports four primary tasks for rigorous multimodal LLM assessment:

A. Surgical Workflow Recognition: Phase (coarse) and step (fine-grained) classification
B. Instrument-Centered Action Detection: Recognition of discrete actions such as “Grasp”, “Cut”, “Idle”
C. Pixel-Level Instrument Segmentation: Dense, per-class semantic segmentation
D. Interactive Visual-Conversational Reasoning: Structured VQA that goes beyond pure vision or pure language

Evaluation Metrics:

For text-only classification (workflow, step, action, count), exact-match accuracy is employed:

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^{N}\mathbbm{1}\bigl[\hat y_i = y_i\bigr]$

For segmentation, mean Intersection over Union (mIoU) is computed:

$\mathrm{IoU}_c = \frac{TP_c}{TP_c + FP_c + FN_c}$

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^{C}\mathrm{IoU}_c$

Here, $TP_c$ , $FP_c$ , and $FN_c$ denote true positives, false positives, and false negatives for class $c$ over all pixels, and $C$ is the number of classes.

4. Baseline Model Architectures, Training Paradigms, and Experimental Results

Two representative multimodal LLM baselines are evaluated:

OMG-LLaVA [Zhang et al. '24]: Integrates segmentation-capable visual encoder with vision-language decoder; supports both VQA and pixel masks.
LLaVA [Liu et al. '23]: Vision–LLM lacking a segmentation head; VQA-only.

Training proceeds in two stages:

Pre-training: Core encoders/LLMs are frozen while projection heads are trained for broad image–text grounding.
Instruction Tuning: On SurgMLLMBench (excluding MAVIS for cross-domain splits). LoRA is used for adapting LLM weights; batch size 4 (OMG-LLaVA)/16 (LLaVA), one-epoch runs.

Key Performance Findings:

Task	Dataset/Domain	Model	Cross-Domain (%)	Per-Dataset (%)
Phase Acc.	Cholec80 (LS)	LLaVA	81.98	77.32
Phase Acc.	Cholec80 (LS)	OMG-LLaVA	76.51	—
mIoU	GraSP (RAS)	OMG-LLaVA	66.65	53.06

On the MAVIS subset, which is never seen during cross-domain instruction tuning, generalization is demonstrated:

LLaVA stage accuracy: 67.67% (cross-domain), improving to 75.70% when fine-tuned on MAVIS.
OMG-LLaVA mIoU: drops from 53.96% (cross-domain) to 36.89% (fine-tuned), indicating some domain gap particularly in the segmentation decoder.

Instruction-tuned models exhibit robust cross-domain transfer, higher stability in segmentation masks, and improved detection of small or low-contrast instruments. Some label confusions arise from taxonomy imbalances (e.g., “Retraction” vs. “Pull”), but these are typically contextually appropriate.

5. Design Implications, Limitations, and Future Directions

SurgMLLMBench provides a unified, rebust standard for multimodal LLM evaluation in surgery:

Reproducibility: All data, annotation schemas (COCO-style for masks, fixed templates for VQA), and evaluation splits are strictly defined, enabling head-to-head method comparison and benchmarking.
Comprehensiveness: Simultaneous supervision of workflow, action, segmentation, and VQA is supported in a single resource, advancing rich multimodal LLM development.

Extensibility Recommendations:

Incorporation of temporal modalities (full video, kinematics) and additional sensory data (audio, depth)
Expansion of tool/action taxonomies, support for instance segmentation/tracking, and topology-aware metrics
Evaluation of real-time reliability, calibration, and safety for clinical translation

This suggests future work could leverage SurgMLLMBench as a bedrock for domain-adaptive, temporally aware, or uncertainty-quantifying surgical AI systems.

6. Positioning in the Context of Surgical Multimodal LLM Benchmarks

Compared to prior surgical MLLM benchmarks such as SAP-Bench (Xu et al., 8 Jun 2025), which focuses on surgical action planning with temporally grounded clips and next-action prediction, SurgMLLMBench emphasizes dense scene understanding across workflow, semantic segmentation, action typing, and VQA in a much broader taxonomic and procedural landscape.

While SAP-Bench offers atomic action prediction metrics in cholecystectomy, SurgMLLMBench’s design accommodates broader surgical domains and modalities, supporting compositional multimodal tasks beyond sequence forecasting, including conversational-visual interactions and dense spatial reasoning.

7. Release and Reproducibility Considerations

SurgMLLMBench is slated for public release with COCO-format segmentation masks, VQA annotation files, normalized train/validation/test splits, and standardized evaluation scripts via Hugging Face and other open repositories. This ensures that subsequent research can directly benchmark model architectures, training protocols, and algorithmic advances using fully comparable protocols (Choi et al., 26 Nov 2025).

The benchmark’s structure enables transparent evaluation, robust cross-domain generalization testing, and the systematic paper of surgical scene understanding using state-of-the-art and emerging multimodal LLM architectures.