Interactive Video Corpus Retrieval

Updated 9 December 2025

Interactive Video Corpus Retrieval (IVCR) is a dynamic paradigm that retrieves video segments using multi-turn dialogue and multimodal feedback.
It employs ensemble scoring, adaptive search heuristics, and LLM-driven explanations to refine queries and achieve higher retrieval accuracy.
IVCR systems integrate user feedback and explainable rationales, enabling faster, personalized searches with measurable performance gains.

Interactive Video Corpus Retrieval (IVCR) is the paradigm of retrieving video segments or complete videos from large-scale corpora via dynamic, multi-turn user-system interactions. IVCR generalizes traditional one-shot video and moment retrieval by embedding ongoing dialog, iterative refinement, and multimodal feedback within the search process. Recent work defines IVCR as a setting where systems not only process complex or evolving natural language queries but also adapt to user clarification, provide explainable retrieval rationales, and operate at both video- and moment-localization granularity (Han et al., 1 Dec 2025). This field addresses growing user demand for conversational, personalized, and deeply interactive search over video databases.

1. Task Formulation and Problem Setting

The formal IVCR task couples multi-turn dialog with both corpus-level and segment-level retrieval. At turn $t$ , the system receives the dialog history $\{q_1,\ldots,q_t\}$ and returns either a ranked video list or a timestamped moment $(v_j,s,e)$ . The key requirements are:

Multi-turn context: All system outputs must condition on the full dialog history.
Mode switching: Systems must support transitions between whole-video retrieval and moment-level localization within videos, as well as “analogous” queries (e.g., “find something like this scene”).
Explanation: Each output must include a human-readable explanation of relevance or retrieval rationale.
User feedback integration: IVCR supports natural-language clarifications and corrections from users over the course of a session, allowing iterative query refinement.

Mathematically, if $Q_{<t}$ is the user history prior to turn $t$ , the system infers intent $I_t$ (video, moment, or dialogue), selects or ranks candidate videos $v$ by a fusion function $s(v\,|\,Q_{<t},q_t)$ , and, for moment retrieval, additionally predicts start and end times via a localization score $\mathrm{Loc}(s,e\,|\,Q_{<t},q_t,F_v)$ (Han et al., 1 Dec 2025, Li et al., 2022). This joint modeling distinguishes IVCR from both static retrieval and pipeline-based video QA.

2. Core Methodological Advances

A range of architectures has been introduced for IVCR, with modern systems relying on multi-modal encoders, adaptive search heuristics, ensemble scoring, and retrieval-augmented generation:

Multi-Granularity Ensemble Models: The combination of coarse-grained (e.g., CLIP) and fine-grained (e.g., BEiT-3) vision-text models enables robust handling of both global scene semantics and local visual details. Ensemble scoring is achieved via weighted sums of normalized per-frame similarity scores, where weights $(\alpha, \beta)$ are typically tuned via grid search on held-out data; late-fusion approaches support modular extensibility (Tran et al., 11 Apr 2025).
Storage and Efficiency Optimization: Scene detection (TransNetV2) and keyframe deduplication (cosine or pHash) aggressively reduce index size by $70\%$ with only negligible retrieval accuracy loss, thereby increasing search speed by up to $40\%$ (Tran et al., 11 Apr 2025, Nguyen-Nhu et al., 12 Apr 2025).
Contextual and Temporal Reranking: Neighbor score aggregation, global descriptor pooling (SuperGlobal, GeM), and moment-boundary refinement via adaptive bidirectional search (ABTS) or dual-query localization stabilize retrieval performance and support precise, interpretable segment selection (Nguyen-Nhu et al., 12 Apr 2025, Tran et al., 11 Apr 2025).
Interactive Question Generation: Multimodal or purely textual question generators (BART, T0++) trained with information-guided supervision (IGS) are critical for dialog-based disambiguation, maximizing expected ranking gain after each answer (Madasu et al., 2022, Liang et al., 2023, Maeoki et al., 2019).
Reinforcement and Relevance Feedback: Some systems directly optimize the sequential policy via reinforcement learning (A2C or MCTS), planning search graph traversals based on user-provided or simulated feedback (Ma et al., 2023). Classic Rocchio-style relevance feedback is also widely implemented for iterative query updating (Halima et al., 2013, Duan et al., 21 Mar 2025).
Multi-turn LLM Integration: Recent frameworks (InterLLaVA) leverage LLaMA-based text encoders, CLIP/EVA-based visual encoders, and cross-attention fusion modules to generate not only accurate retrievals but detailed explanations and dialog responses across multi-turn sessions (Han et al., 1 Dec 2025). Retrieval-augmented generation (RAG) approaches are also extended to the video domain, directly feeding retrieved frames and subtitles into generative LVLMs (Jeong et al., 10 Jan 2025).

3. Dataset Resources and Benchmarking

The IVCR research community has articulated the need for large-scale, multi-turn, and semantically rich datasets to benchmark interactive retrieval:

IVCR-200K (Han et al., 1 Dec 2025): Contains 12,516 videos, 201,631 user turns, and bilingual (English/Chinese) conversational data with both whole-video and moment-retrieval annotations. Dialogs cover Long2Short, Short2Long, and analogous multi-turn search paradigms. Explanations and intent labels are included at every turn.
MedVidCQA (Li et al., 2022): Suited for visual answer localization in instructional corpora; supports precise evaluation of segment-level retrieval and span localization.
AVSD, MSR-VTT, MSVD, DiDeMo, TVR: Earlier datasets focused on dialog (AVSD), single-shot retrieval, or moment annotation; these have been used as source corpora or for cross-domain validation in interactive retrieval studies (Madasu et al., 2022, Liang et al., 2023, Maeoki et al., 2019, Ma et al., 2023).

Evaluation metrics include Recall@K for video and segment retrieval, mean/median rank, BLEU-4 and relevance scoring for dialog quality, and intersection-over-union (IoU) for moment localization (Han et al., 1 Dec 2025, Li et al., 2022). Protocols increasingly stress the importance of multi-turn improvement curves (e.g., Recall@1 rising sharply over 3–7 dialog rounds) and the efficacy of explanation generation.

4. Interaction Strategies and System Architectures

IVCR system designs converge on the following recurring architectural principles and algorithms:

Dialog-State Tracking and Fusion: Systems encode all prior dialog turns (either via hierarchical LSTMs, transformer stacks, or Q-former cross-attention blocks), ensuring ranking, answer, and explanation generation are history-dependent (Han et al., 1 Dec 2025, Maeoki et al., 2019, Madasu et al., 2022).
Moment Boundary Localization: For moment retrieval, boundary search is guided either by weak supervision (thresholded cosine similarity) or reinforced by stability measures (variance over neighboring frames), frequently with dual-start/end subqueries from users (Tran et al., 11 Apr 2025, Nguyen-Nhu et al., 12 Apr 2025).
User Feedback Loop: Relevance feedback is integrated via weighted query updating (e.g., Rocchio updates with $\lambda,\beta,\gamma$ parameters) (Duan et al., 21 Mar 2025, Halima et al., 2013); interactive RL agents refine recommendations based on user- or simulator-provided concept-centric feedback (Ma et al., 2023).
Adaptive Indexing and Scalability: Systems targeting web-scale require efficient frame and segment indexing (vector stores such as FAISS or Pinecone), graph-based context modeling, and deduplication to maintain latency and memory constraints (Nguyen-Nhu et al., 12 Apr 2025, Duan et al., 21 Mar 2025).
Explainability and Human Factors: Output generation via LLMs includes not only result localization but also natural language explanations. Recent experimental designs stress the cognitive-realism of retrieval (task hint transformations, memory-based filtering) and the impact on human search efficacy (Willis et al., 7 May 2024).

5. Empirical Results and Quantitative Analysis

Experimental studies highlight IVCR’s superiority over one-shot or static retrieval baselines:

Retrieval Accuracy: Multi-turn interaction yields marked improvements—Recall@1 increases by 20–40 points over one-shot models, with additional gains achieved by ensemble reranking and neighborhood aggregation (Tran et al., 11 Apr 2025, Han et al., 1 Dec 2025, Liang et al., 2023, Madasu et al., 2022).
Efficiency: Storage optimization and modular fusion deliver $>30\%$ reduction in retrieval latency, without accuracy loss (Tran et al., 11 Apr 2025, Nguyen-Nhu et al., 12 Apr 2025).
User Studies: Success rates in known-item search degrade only modestly under realistic (memory-based or filtered) task presentations, but collapse with fully synthetic hints, confirming the need for perceptually and semantically salient cues (Willis et al., 7 May 2024).
Reinforcement Learning Approaches: Interactive RL agents outperform static re-rankers and classic graph traversal, especially in retrieving deep-ranked or hard-to-find moments (recall gains of 5–14% on hard sets) (Ma et al., 2023).
Ablations: Removal of ensemble, reranking, or dialog history fusion modules causes significant regression in performance, confirming their necessity for robust IVCR (Tran et al., 11 Apr 2025, Han et al., 1 Dec 2025).

6. Challenges, Limitations, and Future Directions

Current IVCR deployments face several open research questions:

Hyperparameter and Threshold Selection: Many decision points (gap size, similarity thresholds, reranking weights) are fixed via heuristic tuning; end-to-end learning of these parameters, possibly via differentiable boundary prediction, remains a challenge (Tran et al., 11 Apr 2025, Nguyen-Nhu et al., 12 Apr 2025).
Scalability and Efficiency: Maintaining sub-second response for dynamic, large-scale corpora is non-trivial; ongoing work integrates subgraph sampling and parallelized vector search (Duan et al., 21 Mar 2025, Nguyen-Nhu et al., 12 Apr 2025).
Generalization and Domain Adaptation: IVCR models trained on blended datasets (e.g., IVCR-200K) can struggle with domain transfer, particularly under free-form or out-of-vocabulary queries (Han et al., 1 Dec 2025).
Explainability and Human-Centric Evaluation: Integrating transparent, LLM-driven justifications for retrieval decisions is an active area; future benchmarks may mix filtered-visual, synthetic, and textual cues to emulate real-world memory and perception constraints (Willis et al., 7 May 2024).
Richer Feedback Modalities: There is increasing emphasis on broadening feedback beyond binary relevance, incorporating graded ratings, free-form clarifications, and multimodal (e.g., sketch or point) interactions (Ma et al., 2023, Han et al., 1 Dec 2025).
End-to-End Systems: Full end-to-end trainable LLM architectures for IVCR remain computationally expensive and are often reliant on two-stage or cascade paradigms for tractability (Han et al., 1 Dec 2025).

7. Historical Perspective and Summary Table

Interactive video retrieval originated with concept-based, ontology-grounded indexing, and relevance feedback mechanisms in multilingual systems (Halima et al., 2013). The field has since evolved through dialog-driven retrieval (Maeoki et al., 2019), reinforcement-guided search (Ma et al., 2023), and, most recently, large-scale multi-turn datasets and LLM-based reranking (Han et al., 1 Dec 2025). The following table summarizes key methodological features across representative IVCR systems:

System / Paper	Dialog / QA Loop	Moment Retrieval	Ensemble / Fusion	Reranking	Explanations / LLMs
(Han et al., 1 Dec 2025)	Yes	Yes	Yes (multi-modal)	Yes	Yes (LLM)
(Tran et al., 11 Apr 2025)	-	Yes (dual-query)	Yes (coarse+fine)	Yes	Potential
(Ma et al., 2023)	Yes (RL-driven)	Yes	No	RL-graph	-
(Li et al., 2022)	QA-oriented	Yes	Cross-modal	-	-
(Halima et al., 2013)	Feedback	-	Concept-weighted	Rocchio	-

This progression underscores IVCR’s transition from static, concept-driven pipelines to conversational, explainable, and multimodal architectures capable of addressing evolving user intents and segment-level localization at scale.