MVQ-60: Multi-Hop VideoQG Dataset
- MVQ-60 is a large-scale dataset for multi-hop video question generation, merging zero-hop QA pairs to create two-hop questions that span distinct video segments.
- It utilizes an automated pipeline over TVQA⁺ annotations with formal segment alignment and semantic overlap criteria to ensure high-quality multi-hop questions.
- The dataset covers 21,793 clips from six TV series, supporting robust vision-language research with structured splits and validated annotation protocols.
MVQ-60 is a large-scale dataset constructed to benchmark multi-hop video question generation (VideoQG) requiring reasoning across temporally separated video segments. Developed for the evaluation and training of algorithms capable of compositional and multi-hop understanding in vision-language tasks, MVQ-60 provides structured, high-quality question–answer pairs derived from multi-step interactions with long-form TV episode content. The dataset is instantiated via an automated pipeline over TVQA⁺ annotations using formal criteria for segment alignment and semantic overlap, resulting exclusively in two-hop questions that demand integrating facts across distinct portions of the video narrative.
1. Formal Specification
MVQ-60 adopts a rigorous multi-hop construction framework. Let the base set consist of zero-hop question, answer, and metadata triples from TVQA⁺, with metadata denoting episode and segment. Multi-hop (MVQ-2) questions are defined by pairing two zero-hop instances within the same episode but different segments (), and ensuring that the answer to the second occurs as a substring in the first question:
Let , where the span is replaced by the second question . The principle generalizes to MVQ- formulations via
MVQ-60 contains only two-hop () questions following this merge procedure. The full pseudocode detailing this operation is presented in Appendix Section A.1 (Algorithm 1) of (Phukan et al., 11 Nov 2025).
2. Automated Construction Pipeline
MVQ-60 eschews manual annotation in favor of a scalable algorithmic merger strategy, closely inspired by the MUSIQUE textual multi-hop generation protocol [Trivedi et al. ’22]. Beginning with TVQA⁺’s corpus of 152,545 zero-hop QA pairs, filtering is first applied to retain only those with concise answers ( words) and readable questions ( words), thus forming . For each episode, every possible pair of QA triples from distinct segments is examined; if the answer from the second is found in the text of the first question, the formal merge operation is executed to produce a two-hop question. Notably, no balancing for answer-type diversity is applied, leveraging TVQA⁺’s inherent variance spanning character, object, location, and action types. Approximately 60,000 unique questions result from this deterministic multi-hop aggregation.
3. Composition, Splits, and Statistical Profile
MVQ-60 covers 21,793 video clips (average duration 75.9 s), sampled from six disparate television series: “Friends,” “The Big Bang Theory,” “How I Met Your Mother,” “House M.D.,” “Grey’s Anatomy,” and “Castle.” Collectively, these form 460 hours of video data and supply the source for both questions and answers.
The dataset comprises:
- 60,000 two-hop (MVQ-2) questions.
- 60,000 corresponding gold-standard answers, inherited from original TVQA⁺ entries.
- Average question length: ~27 words; average answer length: ≤3 words; answer types span character names, object labels, noun phrases, and yes/no categorical forms.
Splits are administered at the episode level for non-overlapping partitions: | Split | Percentage | Approximate Question Count | |-----------|------------|---------------------------| | Training | 80% | ≈48,000 | | Validation| 10% | ≈6,000 | | Test | 10% | ≈6,000 |
A plausible implication is that episode-level separation mitigates information leakage between splits, thereby supporting robust generalisation evaluation.
4. Annotation Protocol and Quality Assessment
Quality control has been systematically executed via expert annotation and automatic semantic validation. A random sample of 200 merged questions was scored along six axes—Fluency, Relevance, Multi-Hop Reasoning, Engagingness, Factual Correctness, Inclusiveness—by three trained annotators using a 0–3 scale. Achieved mean scores:
- Fluency: 2.92
- Multi-Hop Reasoning: 3.00
- Engagingness: 2.80
- Factual Correctness: 3.00
Cohen’s κ inter-rater agreement of 0.72 indicates substantial consistency. Additional automatic metrics, including BERTScore F1 (0.79) and GPT-5 self-evaluation (Fluency 2.88, Multi-Hop 2.82, Factual 2.74) on another 200-question subset, further substantiate grammaticality, cross-segment reasoning, and factual grounding.
This suggests MVQ-60 substantially meets its multi-hop reasoning objectives and offers reliable input for algorithmic benchmarking under vision-language paradigms.
5. Limitations and Domain Constraints
Despite its scale and procedural rigor, MVQ-60 has enumerated limitations:
- Domain bias: Exclusivity to six mainstream English-language TV shows; other informal, documentary, educational, or non-scripted video genres are not represented.
- Reasoning depth: Only two-hop chains are included; deeper compositional reasoning (≥3 hops) is untested.
- Input modalities: Dataset input is restricted to visual frames and transcript; features such as raw audio, object detection, and external knowledge sources are omitted.
- Language: All content is English-only; cross-lingual generalisation is infeasible.
- Error propagation: Any inaccuracies in zero-hop extraction potentially manifest in merged questions.
A plausible implication is that transferability and cross-modal robustness may be restricted, emphasizing the need for interpretive caution when generalizing results.
6. Research Applications and Prospective Extensions
MVQ-60 underpins several research trajectories:
- End-to-end multi-hop VideoQA model training, where the objective is answering as well as generating multi-hop questions over video.
- Benchmarking vision-language transformers on tasks requiring compositional, cross-segment reasoning.
- Pretraining or adaptation for downstream tasks encompassing video narrative understanding, retrieval, and dialogue.
Recommended directions for future dataset evolution include:
- Multilingual question generation to probe cross-lingual reasoning over video.
- Construction of higher-hop () multi-hop chains via iterative merge logic.
- Expansion to diverse domains (e.g., lecture, sports, surveillance footage).
- Tri-modal input representations integrating video, transcript, and audio features.
- Explicit balancing by question type to support fair and granular evaluation.
MVQ-60 is publicly accessible at https://github.com/AnupamPandey199949/VideoChain and represents, to date, the most rigorously constructed, large-scale benchmark for multi-hop question generation in the vision-language research field. Its algorithmic pipeline, annotation protocol, and resultant dataset properties collectively facilitate progress in multi-hop video reasoning and generation, while highlighting methodological considerations for future multimodal, multistep QA benchmarks.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free