MedVideoCap-55K: Medical Video Dataset

Updated 2 July 2026

MedVideoCap-55K is a large-scale dataset featuring 55,803 meticulously annotated medical video clips with dual-level captions for precise video generation.
It employs a rigorous extraction and quality assurance pipeline using CLIP-based classifiers, manual spot-checks, and advanced filtering techniques.
The MedGen model fine-tuned on this dataset demonstrates significant improvements in clinical video synthesis metrics and downstream medical analysis tasks.

MedVideoCap-55K is a large-scale, caption-rich dataset designed specifically for medical video generation tasks. Consisting of over 55,000 meticulously curated clips, it provides granular textual annotation suited to the demands of high-fidelity, domain-precise visual synthesis in medicine. The dataset, together with the associated MedGen model, addresses the pronounced scarcity of high-quality, medically accurate resources for generative video systems in clinical, educational, and research applications (Wang et al., 8 Jul 2025).

1. Dataset Structure and Statistical Profile

MedVideoCap-55K comprises 55,803 video clips, each of fixed 720×480 resolution, with an average duration of 8 seconds (range: 6–10 seconds for approximately 80% of clips). The dataset is distributed across key medical modalities and use cases, including:

Domain/Use Case	Proportion	Example Types
Clinical practice	~30%	Surgeries, patient exams
Medical imaging	~25%	Ultrasound, endoscopy
Medical teaching	~20%	Classroom, demo videos
Science popularization	~15%	Animated physiology
Medical animation/simulation	~10%	3D models, procedural sims

Scenarios cover over 20 distinct procedures, with digestive system focus predominating (~40%), followed by cardiovascular (~15%), musculoskeletal (~10%), respiratory (~10%), and general anatomy/animation (~25%). Each video is annotated at two levels: a brief caption (15–20 words) for prompt-based generation, and a detailed caption (average 174 words) thoroughly describing actions, tools, and anatomical context. No per-frame time stamps or pixel-level masks are provided; clips are annotated strictly via textual descriptions, leveraging 8 evenly sampled frames per video for caption synthesis.

2. Data Collection Pipeline and Quality Assurance

Source videos were identified from the public YouTube corpus (25 million candidates) through a staged pipeline: initial medical-keyword matching, followed by a metadata classifier (snowflake-arctic-embed-m) yielding 37,000 seed videos, and subsequent channel expansion to 140,000 videos (~10,269 hours). Segment extraction criteria included:

Frame-level domain classifier using a CLIP-based model $C(x_i)$
Temporal coherence via CLIP embedding cosine similarity $S(x_i, x_{i-1}) > \tau$
Minimum segment length of 6 seconds at 1 FPS
Minimum resolution of 480 p

Post-processing filters encompassed black-border removal (OpenCV + Hough lines), subtitle/OCR rejection if more than 20 words per frame (EasyOCR), aesthetic scoring via LAION predictor (threshold ≥ 3.0), and technical quality via Dover predictor (score = 0 or joint Dover ≤ 0.3 or aesthetic ≥ 4.0).

Captions were generated using GPT-4o, given eight frames, video metadata, and transcripts, with strict prompt guidelines. Manual spot-checks (200 per stage) ensured a ≥95% annotation quality. Licensing explicitly restricts use to research, ensuring no inclusion of identifiable patient data or sensitive clinical content.

3. Caption Schema and Annotation Practices

The annotation schema features two textual layers. The brief caption provides a high-level summary for prompt-based tasks, while the detailed caption includes procedural, anatomical, and action-oriented specifics, e.g., “The surgeon inserts a 5 mm trocar at the McBurney point (≈ $L_1$ abdominal quadrant), then introduces a grasping forceps to retract the appendix.”

Anatomical references are frequently rendered in LaTeX notation for disambiguation and to support downstream parsing, e.g., “McBurney point ( $M$ )” or “forceps axis $\vec{f}$ makes angle $\theta \approx 30^\circ$ with patient midline.” To ensure consistency and reduce ambiguities, shared prompt templates are used in multi-modal LLM settings, post-generation regex routines enforce the presence of key terminology, and any hallucinated or ambiguous descriptions are manually corrected.

4. MedGen Model Architecture and Training Methodology

MedGen utilizes a latent-space video diffusion architecture adapted from HunyuanVideo, with modifications for domain transfer:

LoRA modules (rank 32) added to cross-attention layers for domain knowledge injection
Medical-domain text encoder derived from a CLIP variant fine-tuned on biomedical data

The model is pretrained on general video data, then fine-tuned using MedVideoCap-55K. Fine-tuning employs 8× NVIDIA A800 GPUs, with 50,000 steps, batch size 32, LoRA rank 32, learning rate 5×10⁻⁵, bf16 precision, and 93 diffusion timesteps, guidance scale set at 1.0. The core loss is the latent-space DDPM mean squared error:

$\mathcal{L}_\text{diff} = \mathbb{E}_{z_0,\epsilon,t}\big[\|\epsilon - \epsilon_\theta(z_t,t,\text{cap})\|^2\big]$

An optional medical-fidelity term is proposed:

$\mathcal{L}_\text{med} = \lambda\,\mathbb{E}_{\text{gen}}\big[1 - \text{Sim}_\text{med}(\hat{z}, \text{cap})\big]$

$\mathcal{L} = \mathcal{L}_\text{diff} + \mathcal{L}_\text{med}$

Generic training pseudocode is:

$S(x_i, x_{i-1}) > \tau$ 0

5. Evaluation Protocols and Empirical Results

Evaluation is conducted on Med-VBench (adapted from VBench) and the VideoScore suite. Metrics include:

Imaging Quality
Subject Consistency
Background Consistency
Motion Smoothness
Warping Error ( $\mathrm{WarpErr} = \frac{1}{N}\sum_{i=1}^N d(F(\hat{x}_{i-1}), \hat{x}_i)$ )
Visual Quality (VideoScore)
Temporal Consistency (VideoScore)
Text Alignment (VideoScore)
Factual Consistency (VideoScore)

In Med-VBench total scores:

Model	Score
MedGen (OS)	70.93
Hailuo	69.45
Pika (prop.)	70.29
Kling	72.32
Sora	71.92

Human evaluation by three physicians covered Text Alignment, Medical Accuracy, and Visual Quality; MedGen won the majority vote in ≥70% of matchups, with strong inter-rater reliability (Cohen’s κ ≈ 0.80). In downstream augmentation, adding MedGen-generated data improved F1 on HyperKvasir by +15.3% and on SurgVisDom by +11.7%.

6. Access, Usage, and Prospective Research Directions

The MedVideoCap-55K dataset is accessible via the GitHub repository https://github.com/FreedomIntelligence/MedGen and through HuggingFace Datasets:

$S(x_i, x_{i-1}) > \tau$ 1

MedGen inference pipeline usage:

$S(x_i, x_{i-1}) > \tau$ 2

Principal applications include simulation-based clinical and surgical training, patient-education video generation, and large-scale data augmentation for medical video analysis (e.g., tool detection, phase identification). Identified research directions encompass time-aligned frame-level captioning, 3D/VR extensions, interactive LLM-video agents, and richer anatomical labeling (Wang et al., 8 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedVideoCap-55K.