Multi-hop Video Question Generation
- Multi-hop Video Question Generation (MVQG) is a task that creates open-ended questions requiring reasoning across multiple video segments using temporal and relational inference.
- It employs multimodal architectures, including transformer-based and modular models, to fuse visual embeddings with narrative summaries for enhanced context understanding.
- Evaluation through metrics like BLEU, METEOR, and human ratings demonstrates MVQG's effectiveness while highlighting future work in tri-modal fusion and extended temporal reasoning.
Multi-hop Video Question Generation (MVQG) is the task of generating open-ended, contextually grounded questions that require reasoning across multiple temporally separated frames or video segments. MVQG extends the scope of traditional question generation—where questions are based on a single image or a short video segment—to settings where questions must synthesize information that is distributed across longer sequences or distinct video events. MVQG necessitates not only multimodal understanding but also temporal, relational, and narrative inference, often utilizing intermediate representations such as "story summaries" or dialog context.
1. Dataset Construction and Annotation Protocol
Early MVQG work adapted multi-image settings to video, notably using the Multi-VQG (MVQG) dataset built atop VIST’s 5-image photo albums (Yeh et al., 2022). In this protocol, each video instance is modeled as a sequence of sampled consecutive frames. For video-based MVQG, a typical approach is to sample keyframes per segment or use a lightweight video encoder to extract segment-level embeddings.
A more scalable dataset is MVQ-60, constructed automatically from TVQA+, which contains 152,000 zero-hop QA pairs across 21,793 clips from six TV shows (Phukan et al., 11 Nov 2025). The MVQ-60 construction pipeline filters for brevity (questions 15 words, answers 3 words), merges QA pairs sharing episodes but distinct segments, and syntactically merges question templates to enforce multi-hop reasoning requirements. The MVQ-60 split is 80% train, 10% validation, and 10% test, with no episode overlap. Human evaluation yields high quality: fluency (2.92/3), reasoning (3.00/3), engagingness (2.80/3), and factuality (3.00/3).
Annotation protocols for multi-modal MVQG generally involve:
- Listing salient objects/events per sequence
- Writing a concise summary or story of observed events
- Generating one or more engaging, open-ended questions targeting events spanning multiple frames or segments
On MVQG (multi-image), question statistics include a mean length of 10 tokens and a vocabulary size of 608 versus 360 for single-image VQG.
2. Formal Task Definition and Desired Properties
The MVQG task, given a sequence of frames or segments , seeks to produce an engaging question . The problem is formalized as:
and typically, is generated autoregressively:
Optionally, an intermediate summary or story is constructed:
The following constraints are imposed:
- must be open-ended, not mere factoids
- must draw on cross-segment or cross-frame relations and high-level events
- For multi-hop, must not be answerable from a single segment alone
MVQ-60 questions average 27 words, substantially longer and structurally more complex than zero-hop QA pairs.
3. Model Architectures and Multi-hop Reasoning
Two principal architecture paradigms dominate MVQG research.
End-to-end Transformer-based Models
MVQG adapts the VL-T5 backbone, inputting a task prompt and per-frame visual embeddings alongside semantic grounding tokens . Visual embeddings concatenate ROI features, positional embeddings, and object identifiers, followed by LayerNorm. The encoder employs self-attention across all frames:
The decoder is a T5-style transformer generating .
Modular/Multi-Stage Architectures
The VideoChain framework (Phukan et al., 11 Nov 2025) uses a modified BART-large backbone fused with a parallel video stream:
- Video Stream: VideoMAE processes each segment into , with ,
- Text Stream: BART’s token embeddings
- Dual-Stream Encoder: Layer updates
- Cross-modal Fusion:
Module 1 (zero-hop): Generates about a single segment. Module 2 (multi-hop): Inputs [segment, transcript, ] and outputs linking the previous segment’s question to the current segment. This recursive design enables arbitrary hop lengths.
Multi-hop Reasoning Mechanisms
Both MVQG and VideoChain utilize stacked self-attention layers for temporal and relational reasoning ("story arcs"). In the dual-stage MVQG, a story-builder and question-generator (both T5-based) are chained. In VideoChain, cross-modal attention integrates visual and textual cues, while a modular decomposition ensures that each question step is grounded in both preceding context and new segment information.
Adapters can be inserted into pre-trained layers to allow lightweight domain adaptation. Memory-augmented layers (e.g., GNNs, recurrent modules) or explicit temporal position embeddings further capture extended temporal dependencies.
4. Training Protocols and Losses
MVQG models typically employ AdamW optimizer with learning rates on the order of (MVQG) or (VideoChain), batch sizes $4$–$8$, 50 epochs, and decoding via nucleus sampling () or beam search (size $5$) to favor fluency and diversity.
Pretraining and adaptation strategies include:
- Pretraining: VL-T5 on VIST story completion, VQG, and VCR prior to MVQG fine-tuning
- Adapter tuning: Lightweight adapters at bottleneck dimensions (≈384–768) enable continual adaptation without catastrophic forgetting
VideoChain’s loss structure includes:
- Standard cross-entropy for the zero-hop (Module 1):\
- Composite cross-entropy plus alignment for the multi-hop module: \ \
Hyperparameters for VideoChain: 2 × Tesla T4 GPUs, 8 hours total training, FP16 mixed precision, .
5. Automatic and Human Evaluation Metrics
MVQG and VideoChain both report comprehensive metric suites. For MVQG:
- BLEU ≈ 42.7, BLEU ≈ 4.8, METEOR ≈ 41.8, BLEURT ≈ –42.2 (VL-T5)
- Human evaluation (5 criteria): top model (VL-T5) wins 35–51% “rank-1”.
VideoChain’s metrics on MVQ-60:
- BERTScore-F1 = 0.7967
- Semantic similarity = 0.8110
- ROUGE-1 = 0.6854, ROUGE-L = 0.6454
- BLEU-1 = 0.6711
- Distinct-1 = 0.7911, Distinct-2 = 0.9850
Human evaluation and GPT-5 Nano agree closely; VideoChain outperforms all zero-shot and ECIS baseline models on fluency, relevance, multi-hop reasoning, and engagingness.
Ablation studies in VideoChain highlight the necessity of both visual grounding (video-relatedness) and modular decomposition. Removing either results in significant drops in multi-hop question quality and overall evaluation scores.
| Model Variant | Fluency | Video‐Rel | Multi-Hop | Factual | Overall Score |
|---|---|---|---|---|---|
| Full model | 2.81 | 2.91 | 2.85 | 2.92 | 1.00 |
| Text-only | 2.66 | 2.09 | 1.54 | 2.36 | 0.31 |
| Single-component | 2.31 | 2.39 | 1.24 | 1.98 | 0.24 |
Qualitative examples include generations where only the full multi-hop model correctly links entities and events across segments, confirming the necessity of temporal cross-segment inference.
6. Qualitative Examples and Analysis
- Case: "Friends S02E01 seg02" (“Monica is cooking.”). Module 1: "What is Monica cooking?" Case: “seg04” (“Chandler asks about Monica’s cooking.”). Module 2: "What is Chandler’s wife cooking?"
On the MVQG image-based task, explicit use of “story summaries” in dual-stage models (STY2Q) yields more specific, event-centric questions. Multi-frame ablation demonstrates that removing supposedly “most relevant” frames causes performance drops, indicating genuine multi-hop use.
Error analyses in VideoChain reveal continued prompt refinement and post-processing rules reduce multi-hop reasoning failures from 24% to 6% and knowledge leaks from 32% to 8%.
7. Limitations and Future Directions
MVQG datasets often use MTurk annotations, which may be more formal and less spontaneous than real-world dialog, potentially affecting model engagement scores. Human evaluation may not fully capture in-the-wild user engagement.
- Extension to video: Sample keyframes or employ video encoders (e.g., VideoMAE, 3D CNNs, pretrained video transformers). Incorporate explicit temporal position embeddings and memory-augmented layers (GNNs), or recurrent multi-hop modules for long-event reasoning.
- Tri-modal input: Integrate non-dialog audio through audio embeddings.
- Joint training: End-to-end optimization of modular components to reduce error propagation.
- Multilinguality: Employ mBART/T5 for non-English question generation.
- Domain adaptation: Transfer learning for educational, surveillance, and instructional content.
- Vision-LLMs: Incorporate advanced architectures (Flamingo, VideoBERT, Vid2Seq) for richer temporal grounding.
- Evaluation: Combine automatic metrics (BLEU, METEOR, ROUGE, BERTScore) with human studies measuring real-world reply rates.
Practical recommendations for future MVQG systems include the use of story summaries (either generated or from subtitles/ASR), pretraining on large-scale video caption/story datasets, adapter layers for image-to-video transfer, and a multi-pronged evaluation suite.
Conclusion
Multi-hop Video Question Generation provides a rigorous testbed for temporal and relational reasoning in multimodal AI systems. Both the Multi-VQG and VideoChain frameworks demonstrate that constructing internal narrative representations across frame or segment sequences substantially improves the specificity, coherence, and engagingness of generated questions. Current research emphasizes story-centric architectures, modular multi-hop reasoning, and comprehensive evaluation, with ongoing work exploring tri-modal fusion, domain and language adaptation, and improved user engagement assessment (Yeh et al., 2022, Phukan et al., 11 Nov 2025).