REAL-Colon-VQA: Temporal VideoQA Dataset

Updated 10 November 2025

REAL-Colon-VQA is a colonoscopic VideoQA dataset featuring eight-frame clips and fine-grained temporal segmentation for dynamic clinical event annotation.
It employs dual-format QA pairs—both short and long forms—to evaluate categorical accuracy and linguistic generalization in surgical contexts.
The dataset benchmarks AI models using metrics like BLEU-4, ROUGE-L, METEOR, and top-1 accuracy, offering scalable potential for advanced clinical tasks.

REAL-Colon-VQA is a colonoscopic video question answering (VideoQA) dataset developed to advance temporally-grounded scene understanding in surgical AI systems. Unlike prior datasets restricted to static images or lacking dynamic event annotation, REAL-Colon-VQA introduces fine-grained temporal segmentation, motion-oriented question types, and semantically varied test scenarios. The corpus is designed to benchmark models that can reason across procedural video clips, challenging current paradigms that neglect intraoperative temporal dynamics.

1. Dataset Scope and Collection

REAL-Colon-VQA sources its material from the REAL-Colon corpus, which contains high-definition (1920×1080, 30 fps) videos of routine adult diagnostic and screening colonoscopy procedures. Each lesion captured undergoes histopathological verification, ensuring diagnostic rigor. The primary dataset consists of 5,200 eight-frame video clips (≈0.93 s per clip), with each segment associated with a short and long-form QA pair. Clips are partitioned into 4,450 for training and 750 for testing, sampled at regular intervals (stride=4 frames) to maximize coverage of operative variability.

2. Annotation Protocol

Temporal annotation in REAL-Colon-VQA relies on fixed-length segmentation: each eight-frame window is included only if the annotated event holds for the majority of frames. Frame-by-frame analysis by trained annotators marks surgical actions such as tool use (catheter, snare, forceps), scope motion (advancing, withdrawing, exiting), illumination mode, occlusion, and flushing, while lesion attributes (size, location, histopathology) are inherited from the original corpus. QA pairs are generated automatically from clinical templates, then paraphrased—rendered as "out-of-template" instances—for robustness testing. Dual expert reviews resolve annotation conflicts, and only consensus-labeled pairs are accepted.

3. Question Taxonomy and Distribution

REAL-Colon-VQA organizes its 5,200 clips across 18 categories spanning six reasoning domains: Instruments (presence, identity, count), Sizing (lesion dimension in mm), Diagnosis (histopathology), Positions (anatomical, on-screen location), Operation Notes (illumination, visibility, flushing, occlusion, irrigation), and Movement (scope actions, lesion drift, camera shake). The test set features an out-of-template subset (~20%), comprising semantically altered/paraphrased questions. This approach stress-tests model linguistic generalization beyond rigid templates.

The approximate distribution across categories is summarized below:

Domain	Categories (n)	Training/Test Clips (per cat.)
Instruments	3	~867 / 150
Sizing	1	~1,150 / 200
Diagnosis	1	~1,150 / 200
Positions	2	~575 / 100
Operation Notes	5	~460 / 70
Movement	5	~460 / 70

Clip-level annotation structure employs the following JSON schema:

{
  "video_id": "video_003",
  "clip_id": "clip_0421",
  "question_id": "q_0421_a",
  "start_frame": 1024,
  "end_frame": 1052,
  "domain": "Movement",
  "category": "scope_advancing",
  "out_of_template": false,
  "question_text": "Is the endoscope being withdrawn?",
  "answer_short": "no",
  "answer_long": "The scope is being held stationary; no withdrawal detected."
}

A plausible implication is that this fine granularity supports benchmarking models across both categorical and free-text QA tasks, and enables evaluation of reasoning over multi-frame procedural contexts.

4. Data Structure and Access

The dataset is organized as follows:

/videos/: full-length procedure .mp4 videos, accompanied by a subdirectory /clips/ of eight-frame segments.
/annotations/: principal JSON file (real_colon_vqa.json) enumerating QA pairs per clip.
/metadata/: secondary JSON (real_colon_metadata.json) with lesion attributes, patient demographics, video resolution, and frame rate.

Distribution is under CC-BY-4.0 licensing, with access to full-resolution video gated by a Data Use Agreement to safeguard patient confidentiality. All data and code are available at https://github.com/madratak/SurgViVQA.

5. Evaluation Metrics and Baseline Models

Benchmarking employs multiple standard QA metrics over both in-template and out-of-template conditions:

BLEU-4: $\mathrm{BLEU\text{-}4} = \exp\bigl(\min(0,\,1 - \tfrac{|\mathrm{ref}|}{|\mathrm{hyp}|}) + \tfrac{1}{4}\sum_{n=1}^4 \log p_n\bigl)$
ROUGE-L: longest common subsequence F-measure.
METEOR: harmonic mean of unigram precision/recall (synonym handling included).
Keyword Accuracy (K-ACC): $\mathrm{K\text{-}ACC} = \frac{\#\{\mathrm{correct\_keywords}\}}{\#\{\mathrm{total\_keywords\_in\_refs}\}}$
Top-1 Accuracy (categorical): $\mathrm{Acc_{Top1}} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\{\hat{y}_i = y_i\}$

Performance comparison (BLEU-4 / ROUGE-L / METEOR / K-ACC):

Model	In-Template	Out-of-Template
PitVQA (Image)	64.6 / 78.5 / 80.0 / 54.1	23.6 / 50.0 / 53.2 / 42.9
SurgViVQA (GPT-2)	72.0 / 82.9 / 84.1 / 63.2	31.2 / 53.6 / 54.9 / 47.7

SurgViVQA exhibits an +11% increase in in-template K-ACC and +4.8% out-of-template versus PitVQA, indicating improved temporal reasoning and linguistic robustness.

6. Practical Usage and Extension Recommendations

Temporal annotation can be leveraged by applying eight-frame sliding windows (stride=4) for data augmentation, with Masked-Video pretraining (e.g., VideoMAE) recommended to encode motion priors. Preservation of tube-masking strategies during finetuning is advised. The incorporation of temporal positional embeddings or relative frame indices is recommended to reinforce frame ordering, which may enhance motion and event inference.

Potential dataset expansions may include new clinical question types (e.g., polyp Paris classification, resection methods), addition of multi-modal signals (e.g., instrument kinematics, irrigation pressure, audio commentary), longer temporal context windows (16–32 frames) for duration-sensitive questions, and the introduction of counting/comparative tasks (e.g., “How many polyps are visible?”, “Which lesion is larger?”). This suggests REAL-Colon-VQA can serve as a scalable foundation for future procedural VideoQA research with progressively complex diagnostic and event-based reasoning.

7. Contextual Significance and Research Directions

REAL-Colon-VQA and its associated SurgViVQA benchmark (Drago et al., 5 Nov 2025) establish a framework for temporally-aware VideoQA in surgical imaging by integrating motion events and diagnostic interpretation across systematically annotated clips. This paradigm addresses limitations inherent to static-frame VQA and tests both temporal modeling and linguistic generalizability. A plausible implication is that REAL-Colon-VQA may catalyze development of more robust AI systems capable of nuanced scene understanding in other endoscopic or intraoperative contexts, particularly those requiring dynamic event detection and clinical reasoning.

PDF Markdown Chat (Pro)

References (1)

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to REAL-Colon-VQA Dataset.