MedVidBench: Medical Video Benchmark
- MedVidBench is a unified, multi-granular benchmark that integrates expert-curated datasets and annotation pipelines to evaluate medical video classification and visual answer localization.
- It employs a robust curation pipeline featuring dual-model caption generation and semantic similarity filtering to ensure clinical relevance and spatiotemporal precision.
- The benchmark supports diverse tasks from video-level classification to frame-level visual answer localization using rigorous metrics and reinforcement learning refinements.
MedVidBench is a unified, multi-granular benchmark for medical video understanding that integrates datasets and annotation pipelines to evaluate and advance cross-modal vision-LLMs in medical domains. It is designed to rigorously assess both classification of medical instructional content and visual answer localization, spanning video, segment, and frame-level tasks. Its datasets are constructed and validated by domain experts, employing diverse data sources and a quality assurance framework to ensure clinical relevance, spatiotemporal precision, and robustness across heterogeneous medical procedures. MedVidBench thus underpins state-of-the-art supervised and reinforcement learning methodologies in medical video analysis (Gupta et al., 2022, Su et al., 6 Dec 2025).
1. Dataset Composition and Sources
MedVidBench aggregates multi-source, expert-curated medical video data, structured at several scales. The expanded benchmark (Su et al., 6 Dec 2025) draws 531,850 video-instruction pairs from eight repositories covering four main domains:
| Domain | Dataset | Videos | QA Pairs |
|---|---|---|---|
| Laparoscopic surgery | CholecT50 | 50 | 7.1K |
| CholecTrack20 | 20 | 102.7K | |
| Cholec80-CVS | 80 | 4.4K | |
| CoPESD | 40 | 70.3K | |
| Open surgery | AVOS | 25 | 62.5K |
| EgoSurgery | 21 | 154.3K | |
| Robotic surgery | JIGSAWS | 103 | 1.0K |
| Nursing procedures | NurViD | 287 | 129.5K |
Earlier releases (Gupta et al., 2022) focus on MedVidCL and MedVidQA: respectively, a 6,117-video, tri-class video classification benchmark and a 3,010 QA-pair dataset over 899 instructional YouTube videos for visual answer localization. All instructional tasks are strictly step-by-step procedure demonstrations, as validated by medical informatics experts.
2. Data Curation, Annotation, and Quality Assurance
MedVidBench employs a rigorous multi-phase curation pipeline (Su et al., 6 Dec 2025):
- Expert-Guided Prompting: For frame-annotated sources, prompts are generated per video segment/region, integrating bounding box triplets and surgical context. Web videos are enriched with Whisper-X ASR transcripts and metadata for multimodal prompting.
- Dual-Model Caption Generation: For each prompt, independent GPT-4.1 and Gemini-2.5-Flash outputs are elicited.
- Semantic Similarity Validation: Sentence-transformer-based filtering discards caption pairs with cosine similarity <0.3, reducing hallucination and model-specific artifacts.
- Human Validation Study: Clinical experts rate prompt-driven captions, with 82.0% of judgments on CoPESD preferring outputs generated by expert prompts, confirming improved accuracy and terminology.
For MedVidCL and MedVidQA (Gupta et al., 2022), expert annotation involved filtering, timestamping, and inter-annotator agreement studies, e.g., Cohen’s κ ≈0.84 (classification) and mean absolute start/end differences of ≤2.5s/3.4s (visual answer localization).
3. Task Definitions and Formalisms
MedVidBench supports a spectrum of medical video understanding tasks:
3.1 Video-Level Tasks
- Medical Video Classification (MVC/MedVidCL): Given , predict by minimizing multiclass cross-entropy,
with as the ground-truth indicator.
- Other Video-Level Tasks (expanded benchmark (Su et al., 6 Dec 2025)): Video Summarization (VS), Critical View of Safety (CVS) assessment, Next Action Prediction (NAP), Skill Assessment (SA).
3.2 Segment and Frame-Level Tasks
- Medical Visual Answer Localization (MVAL/MedVidQA): Given a question , predict temporal segment :
Evaluate using Intersection-over-Union: .
- Additional Fine-Grained Tasks: Temporal Action Grounding (TAG), Dense Video Captioning (DVC), Region Captioning (RC), and Spatiotemporal Grounding (STG), each requiring varying degrees of temporal or spatiotemporal reasoning.
4. Evaluation Protocols and Metrics
MedVidBench utilizes metrics matched to clinical precision and multimodal grounding:
- Classification:
- Accuracy, per-class and macro F1, Precision, Recall (MedVidCL).
- Localization and Grounding:
- Mean IoU (), Recall@1, IoU (TAG, STG).
- Captioning:
- Hybrid reward: unweighted average of normalized SentenceBERT cosine similarity and LLM judge score, the latter derived from five clinical dimensions (terminology precision, identification, specificity, procedural context, action accuracy).
- Standard F1 for event captioning (DVC).
- Data Partitions:
- Stratified splits by video to prevent overlap: MedVidCL (4,217/300/1,600 for train/val/test), MedVidQA (800/49/50 videos).
- Large-scale and “Standard” subsets available for multi-task evaluation (Su et al., 6 Dec 2025).
- Reward Normalization:
- Each task’s reward is remapped via logistic scaling centered on the baseline median and interquartile range to prevent RL collapse from cross-dataset reward imbalance.
5. Baseline Methods, Performance, and Ablations
MedVidCL employs monomodal and multimodal baselines (Gupta et al., 2022):
- Language-only: BigBird-Base (95.7% macro F1; 94.3% for instr. class)
- Vision-only: ViT+Transformer (81.3% macro F1)
- Multimodal: subtitles+ViT+Transformer (83.6% macro F1)
- Language outperforms vision; multimodal yields modest gains.
MedVidQA (Gupta et al., 2022):
- VSL-Base (BiDAF variant) achieves (IoU=0.5: ), far above random (3.2% at IoU=0.5).
- VSL-Qgh (Query-Guided Highlighting, ) further boosts performance.
- Error modes: over-long predicted spans, subtle answer misalignment, confusion of narration with demonstration.
Extended MedVidBench SFT on Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash in all tasks (Su et al., 6 Dec 2025); RL training with MedGRPO further raises grounding (e.g., STG : 0.177→0.202), event captioning (: 0.165→0.210), and summary LLM scores. RL training collapse is averted solely by cross-dataset reward normalization; captioning and grounding enjoy mutual multi-task benefit.
6. Limitations and Prospective Extensions
Several challenges persist across released benchmarks:
- Fine-grained assessment: Accurate CVS scoring remains difficult due to modest data volume (4.4K samples) and model calibration issues.
- Error sensitivity: Over- or under-scoring of clinical exposures, and extraction of visually subtle steps, continue to hinder absolute accuracy.
- Scope: Coverage is currently limited to eight task archetypes and does not extend to 3D imaging or surgical tool usage prediction.
- Future directions: Opportunities exist for human-in-the-loop QA pair refinement, temporal reasoning enhancement, expansion to real-time modalities, architectural improvements incorporating audio and semantic action detectors, and further dataset scaling by mining medically certified channels (Gupta et al., 2022, Su et al., 6 Dec 2025).
7. Significance and Impact
MedVidBench is the first large-scale, expert-verified, multi-dataset benchmark for medical video QA and understanding. It establishes rigorous, clinically relevant baselines for vision-LLMs operating in high-stakes medical environments. By introducing challenging temporally and contextually grounded tasks, releasing large, validated datasets, and defining robust evaluation and training methodologies, MedVidBench catalyzes advances in multimodal learning and domain-specialized reinforcement learning for medical AI applications (Gupta et al., 2022, Su et al., 6 Dec 2025).