Multi-hop Video Question Generation

Updated 17 November 2025

Multi-hop Video Question Generation (MVQG) is a task that creates open-ended questions requiring reasoning across multiple video segments using temporal and relational inference.
It employs multimodal architectures, including transformer-based and modular models, to fuse visual embeddings with narrative summaries for enhanced context understanding.
Evaluation through metrics like BLEU, METEOR, and human ratings demonstrates MVQG's effectiveness while highlighting future work in tri-modal fusion and extended temporal reasoning.

Multi-hop Video Question Generation (MVQG) is the task of generating open-ended, contextually grounded questions that require reasoning across multiple temporally separated frames or video segments. MVQG extends the scope of traditional question generation—where questions are based on a single image or a short video segment—to settings where questions must synthesize information that is distributed across longer sequences or distinct video events. MVQG necessitates not only multimodal understanding but also temporal, relational, and narrative inference, often utilizing intermediate representations such as "story summaries" or dialog context.

1. Dataset Construction and Annotation Protocol

Early MVQG work adapted multi-image settings to video, notably using the Multi-VQG (MVQG) dataset built atop VIST’s 5-image photo albums (Yeh et al., 2022). In this protocol, each video instance is modeled as a sequence of $n=5$ sampled consecutive frames. For video-based MVQG, a typical approach is to sample $n$ keyframes per segment or use a lightweight video encoder to extract segment-level embeddings.

A more scalable dataset is MVQ-60, constructed automatically from TVQA+, which contains 152,000 zero-hop QA pairs across 21,793 clips from six TV shows (Phukan et al., 11 Nov 2025). The MVQ-60 construction pipeline filters for brevity (questions $\leq$ 15 words, answers $\leq$ 3 words), merges QA pairs sharing episodes but distinct segments, and syntactically merges question templates to enforce multi-hop reasoning requirements. The MVQ-60 split is 80% train, 10% validation, and 10% test, with no episode overlap. Human evaluation yields high quality: fluency (2.92/3), reasoning (3.00/3), engagingness (2.80/3), and factuality (3.00/3).

Annotation protocols for multi-modal MVQG generally involve:

Listing salient objects/events per sequence
Writing a concise summary or story of observed events
Generating one or more engaging, open-ended questions targeting events spanning multiple frames or segments

On MVQG (multi-image), question statistics include a mean length of 10 tokens and a vocabulary size of 608 versus 360 for single-image VQG.

2. Formal Task Definition and Desired Properties

The MVQG task, given a sequence of frames or segments $I_1, ..., I_n$ , seeks to produce an engaging question $Q$ . The problem is formalized as:

$Q^* = \arg\max_{Q} P(Q\,|\,I_1,\dots,I_n)$

and typically, $Q$ is generated autoregressively:

$P(Q\,|\,I_{1:n}) = \prod_{t=1}^{T} P(q_t\,|\,q_{<t},I_{1:n})$

Optionally, an intermediate summary or story $S$ is constructed:

$S^* = \arg\max_S P(S\,|\,I_{1:n}), \quad Q^* = \arg\max_Q P(Q\,|\,S, I_{1:n})$

The following constraints are imposed:

$Q$ must be open-ended, not mere factoids
$Q$ must draw on cross-segment or cross-frame relations and high-level events
For multi-hop, $Q$ must not be answerable from a single segment alone

MVQ-60 questions average 27 words, substantially longer and structurally more complex than zero-hop QA pairs.

3. Model Architectures and Multi-hop Reasoning

Two principal architecture paradigms dominate MVQG research.

End-to-end Transformer-based Models

MVQG adapts the VL-T5 backbone, inputting a task prompt and per-frame visual embeddings $V_i$ alongside semantic grounding tokens $G_i$ . Visual embeddings concatenate ROI features, positional embeddings, and object identifiers, followed by LayerNorm. The encoder employs self-attention across all frames:

$\text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k}) V$

The decoder is a T5-style transformer generating $Q$ .

Modular/Multi-Stage Architectures

The VideoChain framework (Phukan et al., 11 Nov 2025) uses a modified BART-large backbone fused with a parallel video stream:

Video Stream: VideoMAE processes each segment into $V \in \mathbb{R}^{T \times d_v}$ , with $T \approx 1568$ , $d_v=1024$
Text Stream: BART’s token embeddings $X \in \mathbb{R}^{N \times d_t}$
Dual-Stream Encoder: Layer $\ell$ updates

$h^\ell = \mathrm{LayerNorm}(h^{\ell-1} + \text{MHAtt}(h^{\ell-1}, h^{\ell-1}, h^{\ell-1}))$

$v^\ell = \mathrm{LayerNorm}(v^{\ell-1} + \text{MHAtt}(v^{\ell-1}, v^{\ell-1}, v^{\ell-1}))$

Cross-modal Fusion:

$c^\ell = \mathrm{LayerNorm}(h^\ell + \text{MHAtt}(h^\ell, v^\ell, v^\ell)) \ f^\ell = \mathrm{LayerNorm}(c^\ell + \text{FFN}(c^\ell))$

Module 1 (zero-hop): Generates $q_1$ about a single segment. Module 2 (multi-hop): Inputs [segment, transcript, $q_1$ ] and outputs $q_2$ linking the previous segment’s question to the current segment. This recursive design enables arbitrary hop lengths.

Multi-hop Reasoning Mechanisms

Both MVQG and VideoChain utilize stacked self-attention layers for temporal and relational reasoning ("story arcs"). In the dual-stage MVQG, a story-builder and question-generator (both T5-based) are chained. In VideoChain, cross-modal attention integrates visual and textual cues, while a modular decomposition ensures that each question step is grounded in both preceding context and new segment information.

Adapters can be inserted into pre-trained layers to allow lightweight domain adaptation. Memory-augmented layers (e.g., GNNs, recurrent modules) or explicit temporal position embeddings further capture extended temporal dependencies.

4. Training Protocols and Losses

MVQG models typically employ AdamW optimizer with learning rates on the order of $1\mathrm{e}{-4}$ (MVQG) or $3\mathrm{e}{-5}$ (VideoChain), batch sizes $4$–$8$, 50 epochs, and decoding via nucleus sampling ( $p=0.9$ ) or beam search (size $5$) to favor fluency and diversity.

Pretraining and adaptation strategies include:

Pretraining: VL-T5 on VIST story completion, VQG, and VCR prior to MVQG fine-tuning
Adapter tuning: Lightweight adapters at bottleneck dimensions (≈384–768) enable continual adaptation without catastrophic forgetting

VideoChain’s loss structure includes:

Standard cross-entropy for the zero-hop (Module 1):\

$L_1 = -\sum_{t=1}^T \log P(y_t | y_{<t}, \text{transcript}, V_\text{embedding})$

Composite cross-entropy plus alignment for the multi-hop module: \ $L_2 = L_\mathrm{CE} + \lambda \cdot L_\mathrm{align}$ \

$L_\mathrm{align} = \lVert \text{Enc}_\text{text}(\text{transcript}_2 \Vert q_1) - \text{Enc}_\text{video}(V_2) \rVert^2$

Hyperparameters for VideoChain: 2 × Tesla T4 GPUs, 8 hours total training, FP16 mixed precision, $\lambda = 0.1$ .

5. Automatic and Human Evaluation Metrics

MVQG and VideoChain both report comprehensive metric suites. For MVQG:

BLEU $_1$ ≈ 42.7, BLEU $_4$ ≈ 4.8, METEOR ≈ 41.8, BLEURT ≈ –42.2 (VL-T5 $_F$ )
Human evaluation (5 criteria): top model (VL-T5 $_F$ ) wins 35–51% “rank-1”.

VideoChain’s metrics on MVQ-60:

BERTScore-F1 = 0.7967
Semantic similarity = 0.8110
ROUGE-1 = 0.6854, ROUGE-L = 0.6454
BLEU-1 = 0.6711
Distinct-1 = 0.7911, Distinct-2 = 0.9850

Human evaluation and GPT-5 Nano agree closely; VideoChain outperforms all zero-shot and ECIS baseline models on fluency, relevance, multi-hop reasoning, and engagingness.

Ablation studies in VideoChain highlight the necessity of both visual grounding (video-relatedness) and modular decomposition. Removing either results in significant drops in multi-hop question quality and overall evaluation scores.

Model Variant	Fluency	Video‐Rel	Multi-Hop	Factual	Overall Score
Full model	2.81	2.91	2.85	2.92	1.00
Text-only	2.66	2.09	1.54	2.36	0.31
Single-component	2.31	2.39	1.24	1.98	0.24

Qualitative examples include generations where only the full multi-hop model correctly links entities and events across segments, confirming the necessity of temporal cross-segment inference.

6. Qualitative Examples and Analysis

Case: "Friends S02E01 seg02" (“Monica is cooking.”). Module 1: "What is Monica cooking?" Case: “seg04” (“Chandler asks about Monica’s cooking.”). Module 2: "What is Chandler’s wife cooking?"

On the MVQG image-based task, explicit use of “story summaries” in dual-stage models (STY2Q $_{CLIP}$ ) yields more specific, event-centric questions. Multi-frame ablation demonstrates that removing supposedly “most relevant” frames causes performance drops, indicating genuine multi-hop use.

Error analyses in VideoChain reveal continued prompt refinement and post-processing rules reduce multi-hop reasoning failures from 24% to 6% and knowledge leaks from 32% to 8%.

7. Limitations and Future Directions

MVQG datasets often use MTurk annotations, which may be more formal and less spontaneous than real-world dialog, potentially affecting model engagement scores. Human evaluation may not fully capture in-the-wild user engagement.

Extension to video: Sample keyframes or employ video encoders (e.g., VideoMAE, 3D CNNs, pretrained video transformers). Incorporate explicit temporal position embeddings and memory-augmented layers (GNNs), or recurrent multi-hop modules for long-event reasoning.
Tri-modal input: Integrate non-dialog audio through audio embeddings.
Joint training: End-to-end optimization of modular components to reduce error propagation.
Multilinguality: Employ mBART/T5 for non-English question generation.
Domain adaptation: Transfer learning for educational, surveillance, and instructional content.
Vision-LLMs: Incorporate advanced architectures (Flamingo, VideoBERT, Vid2Seq) for richer temporal grounding.
Evaluation: Combine automatic metrics (BLEU, METEOR, ROUGE, BERTScore) with human studies measuring real-world reply rates.

Practical recommendations for future MVQG systems include the use of story summaries (either generated or from subtitles/ASR), pretraining on large-scale video caption/story datasets, adapter layers for image-to-video transfer, and a multi-pronged evaluation suite.

Conclusion

Multi-hop Video Question Generation provides a rigorous testbed for temporal and relational reasoning in multimodal AI systems. Both the Multi-VQG and VideoChain frameworks demonstrate that constructing internal narrative representations across frame or segment sequences substantially improves the specificity, coherence, and engagingness of generated questions. Current research emphasizes story-centric architectures, modular multi-hop reasoning, and comprehensive evaluation, with ongoing work exploring tri-modal fusion, domain and language adaptation, and improved user engagement assessment (Yeh et al., 2022, Phukan et al., 11 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Multi-VQG: Generating Engaging Questions for Multiple Images (2022)

VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-hop Video Question Generation (MVQG).

Multi-hop Video Question Generation

1. Dataset Construction and Annotation Protocol

2. Formal Task Definition and Desired Properties

3. Model Architectures and Multi-hop Reasoning

End-to-end Transformer-based Models

Modular/Multi-Stage Architectures

Multi-hop Reasoning Mechanisms

4. Training Protocols and Losses

5. Automatic and Human Evaluation Metrics

6. Qualitative Examples and Analysis

7. Limitations and Future Directions

Conclusion

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-hop Video Question Generation

1. Dataset Construction and Annotation Protocol

2. Formal Task Definition and Desired Properties

3. Model Architectures and Multi-hop Reasoning

End-to-end Transformer-based Models

Modular/Multi-Stage Architectures

Multi-hop Reasoning Mechanisms

4. Training Protocols and Losses

5. Automatic and Human Evaluation Metrics

6. Qualitative Examples and Analysis

7. Limitations and Future Directions

Conclusion

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research