Moment Retrieval in Videos

Updated 7 May 2026

Moment Retrieval is defined as the localization of semantically relevant temporal segments in untrimmed videos based on natural language or video queries.
State-of-the-art approaches use transformer-based encoders with query-aware representations and multi-modal fusion to achieve precise temporal alignment.
Ongoing research focuses on improving boundary precision, scalability for long videos, and reducing evaluation biases for robust cross-modal retrieval.

Moment Retrieval (MR) is the task of localizing one or more semantically relevant temporal segments—termed "moments"—in an untrimmed video, conditioned on either a natural-language query or, in recent variants, a query video. MR sits at the intersection of video understanding, cross-modal retrieval, and temporal grounding, demanding aligned modeling of both video content and textual queries over long temporal horizons. This article surveys MR from a technical, model-centric, and evaluative perspective, consolidating state-of-the-art methods, benchmarking practices, and ongoing research questions.

1. Formal Problem Definition and Evaluation Protocols

Moment Retrieval seeks, given an untrimmed video $V$ and a query $Q$ (text or video), to predict a temporal segment $[s, e]$ such that the video subclip $V[s:e]$ is maximally semantically aligned to $Q$ (Otani et al., 2020). In mathematical terms, this is typically expressed as: $(s^*, e^*) = \arg\max_{0 \leq s < e < |V|} S(Q, V[s:e])$ where $S(\cdot, \cdot)$ is a learned cross-modal similarity model. When $Q$ is video (Vid2VidMR), both the query and the candidate segments are sequences of high-dimensional frame or clip embeddings (Kumar et al., 21 Aug 2025).

Standard evaluation metrics include:

Recall@k (R@k) at IoU $\geq m$ : For each query, the metric computes the fraction of test cases where at least one of the top- $k$ predicted moments achieves Intersection over Union (IoU) $Q$ 0 with the ground truth interval(s) (Otani et al., 2020).
mean IoU (mIoU): The average maximum IoU achieved by any proposal per query.
mean Average Precision (mAP) at varied IoU thresholds, especially in multi-moment retrieval scenarios (Cao et al., 20 Oct 2025). Some works further partition metrics by moment durations (short, medium, long) or moment multiplicity (single vs. multi-instance queries).

2. Model Architectures and Methodological Foundations

Early MR methods mainly followed proposal-based (two-stage) or sliding window (one-stage) paradigms. With the maturation of Transformer and set-prediction architectures, contemporary MR is dominated by end-to-end encoder–decoder frameworks, especially Detection Transformer (DETR)-style models and their derivatives (Sun et al., 2024, Lu et al., 2024, Park et al., 2024, Moon et al., 2023, Xiao et al., 2023, Nishimura et al., 2024, Paul et al., 2024, Xu et al., 18 Jan 2025).

Key advances include:

Query-aware representations: Most SOTA architectures inject the query embedding into the video encoding pathway at the earliest possible stage, using cross-attention or query-guided convolutions (Moon et al., 2023, Park et al., 2024, He et al., 2024). Models such as QD-DETR explicitly perform initial cross-attention from query tokens into all video clips to maximize cross-modal conditioning (Moon et al., 2023).
DETR-style decoders: Instead of candidate proposal enumeration, MR models typically use learnable queries—often parametrizing moment center and width—that are refined through transformer layers and anchor-based mechanisms derived from DAB-DETR (Park et al., 2024, Sun et al., 2024, Paul et al., 2024).
Granularity and local-global integration: Architectural blocks such as UVCOM’s comprehensive integration module (Xiao et al., 2023) and MGPN’s coarse-to-fine ‘reading’ pipeline (Sun et al., 2022) explicitly integrate low-level locality (essential for precise boundary marking) with high-level global context aggregation, often via EM-inspired attention, random walks, or group-convolutions over dense candidate grids.
Multi-modality fusion: Modern approaches increasingly utilize heterogeneous input cues including audio (ASR), motion (optical flow), depth, and automatically extracted dense captions (BLIP, MiniGPT-4). Some models fuse these signals dynamically using gating, learned coefficients, or specialized cross-modal conv/attention modules (Xu et al., 18 Jan 2025, Lu et al., 2024, Moon et al., 2023, Park et al., 2024).
MomentMix and Data Augmentation: To enrich feature diversity, especially for short moments, augmentation strategies such as ForegroundMix and BackgroundMix shuffle foreground/background features within and across videos (Park et al., 2024). Denoising objectives and synthetic queries further improve robustness to data sparsity (Ma et al., 16 Jul 2025).
Multi-task Joint Learning: Several frameworks, e.g., TR-DETR, UVCOM, VideoLights, and MS-DETR, implement explicit bidirectional feedback between moment retrieval (MR) and highlight detection (HD), exploiting reciprocal inductive biases (Xiao et al., 2023, Sun et al., 2024, Paul et al., 2024, Ma et al., 16 Jul 2025).

3. Recent Innovations: Multimodal LLMs, Extended Retrieval, and Corpus-wide Tasks

The rise of Multimodal LLMs (MLLMs) and improved video-text pretraining have catalyzed new MR paradigms:

LLM-driven MR: Architectures such as LLaVA-MR (Lu et al., 2024) and GPTSee (Sun et al., 2024) deploy MLLMs (or their outputs) in frame description, token compression, or entire generative pipelines. Dense frame/time encoding, informative frame selection, and dynamic token compression pipelines enable direct sequence-level moment prediction within LLM context bounds (Lu et al., 2024).
Video-to-Video MR (Vid2VidMR): MATR (Kumar et al., 21 Aug 2025) localizes moments in unlabeled target videos using video queries, relying on bi-level sequence alignment via soft-DTW both pre- and post-fusion, with transformer-based joint representations and a self-supervised pretraining regime.
Multi-moment Retrieval (MMR): Datasets such as QV-M $Q$ 1 and frameworks like FlashMMR (Cao et al., 20 Oct 2025) address retrieval of all relevant intervals per query. Novel verification modules, temporal adjustment, and dedicated evaluation metrics (G-mAP, mIoU@K, mR@K) are proposed to handle the increased complexity and to reward both precision and coverage.
Unsupervised/self-supervised MR: MPGN (Jung et al., 2022) dispenses with manual queries, instead generating pseudo queries from video subtitles and visual captions, and achieves competitive performance in purely self-supervised settings.

4. Benchmarks, Datasets, and Evaluation Biases

Dominant MR benchmarks include QVHighlights (YouTube vlogs/news), Charades-STA (indoor actions), TACoS (cooking), ActivityNet Captions, TVSum, and new datasets for multi-moment MR (QV-M $Q$ 2), all standardized in codebases like Lighthouse (Nishimura et al., 2024). Each benchmark defines official splits and metrics (typically R@[email protected]/0.7, mAP@var), and supports modular evaluation pipelines across architectures.

Statistical and evaluation considerations:

Temporal priors and verb biases: Some datasets, particularly Charades-STA and ActivityNet Captions, exhibit strong priors on the temporal placement and verb-conditioned likelihood of moments, enabling off-content “blind” models to approach or exceed learned baselines (Otani et al., 2020).
Single-reference annotation limitation: MR benchmarks typically penalize predictions that match a valid, but unannotated, occurrence of the query event. This results in underreported model performance, particularly in repeated or ambiguous events, and low upper-bound human agreement (Cao et al., 20 Oct 2025, Otani et al., 2020).
Sanity checks and failure modes: Permuting video clips at inference often leaves model predictions unchanged, revealing overreliance on explicit priors or query-linguistic artifacts above genuine audiovisual grounding (Otani et al., 2020).

5. Empirical Advances and Quantitative Results

Across common benchmarks, current SOTA architectures consistently outperform earlier proposal-based and non-attention methods:

Table: QVHighlights (R@[email protected] / Avg mAP) | Method | R@[email protected] | Avg mAP | |------------------|---------|---------| | QD-DETR (Moon et al., 2023) | 62.40 | 39.86 | | UVCOM (Xiao et al., 2023) | 63.55 | 43.18 | | SG-DETR (Gordeev et al., 2024) | 74.20 | 58.80 | | VideoLights-B-pt (Paul et al., 2024) | 70.36 | 47.94 | | MRNet (Xu et al., 18 Jan 2025) | 61.54 | 39.53 | | LLaVA-MR (Lu et al., 2024) | 76.59 | 69.41 | | FlashMMR (Cao et al., 20 Oct 2025), G-mAP | 35.14 | — |
Table: Charades-STA (R@[email protected] / R@[email protected]) | Method | R@[email protected] | R@[email protected] | |------------------|---------|---------| | UVCOM (Xiao et al., 2023) | 59.25 | 36.64 | | SG-DETR (Gordeev et al., 2024) | 71.10 | 52.80 | | VideoLights-B-pt (Paul et al., 2024) | 61.96 | 41.05 | | MRNet (Xu et al., 18 Jan 2025) | 55.84 | — | | LLaVA-MR (Lu et al., 2024) | 70.65 | 49.58 |

Notably, models such as SG-DETR and LLaVA-MR gain substantial improvements by leveraging foundation model video/text encoders, saliency-guided cross-attention, and longer context window handling.

6. Open Problems, Limitations, and Future Directions

Critical research frontiers and unresolved issues include:

Boundary Precision and Short Moments: Methods struggle with fine localization precision for short-duration moments. Length-aware decoders and specific data augmentation (ForegroundMix, BackgroundMix) have made progress, yet mAP for short moments lags substantially (Park et al., 2024).
Temporal Reasoning and Long-range Dependencies: Most transformer-based approaches scale poorly to hour-long videos. Solutions such as sparse attention, hierarchical pooling, and context-aware re-ranking have improved scalability (Tran et al., 11 Apr 2025).
Cross-modal Generalization: Robust retrieval when queries are highly abstract, compositional, or context-dependent remains challenging, both for text-based and video-query MR (Paul et al., 2024, Cao et al., 20 Oct 2025).
Benchmark Bias and Fairness: Strong temporal and linguistic priors in existing datasets can mask the true cross-modal alignment capability of MR models. Multiple reference annotations, new evaluation protocols, and dataset expansion to mitigate biases are active directions (Otani et al., 2020, Cao et al., 20 Oct 2025).
Unified Multitask and Multimodal Pipelines: Integrating video summarization, temporal action detection, and moment retrieval into unified frameworks (e.g., UniMD) demonstrates complementary task synergies and improves overall video understanding (Zeng et al., 2024, Xiao et al., 2023).

7. Software, Reproducibility, and Evaluation Infrastructure

Reproducibility frameworks such as Lighthouse (Nishimura et al., 2024) have standardized MR evaluation by consolidating diverse architectures, feature extractors, and datasets under unified configuration and API schemas. This modularity has exposed algorithmic bottlenecks, metric limitations, and model generalization patterns, enabling more robust ablation analysis and baseline benchmarking.

Lighthouse supports six MR/HD models (Moment-DETR, QD-DETR, EaTR, TR-DETR, UVCOM, CG-DETR), three feature pipelines (CLIP, CLIP+SlowFast, ResNet+GloVe), and five standard datasets, enabling controlled, apples-to-apples comparisons. Remaining challenges include handling cross-domain generalization, scaling to millions of video hours, and facilitating human-in-the-loop or interactive retrieval.

In summary, Moment Retrieval has rapidly evolved from basic sliding-window and proposal-based approaches to sophisticated cross-modal, query-aware, transformer-driven frameworks that exploit context-aware alignment, multi-resolution fusion, and large-scale pretraining. While empirical gains are robust—especially with the integration of foundation models—a set of open methodological and dataset challenges motivates ongoing research in high-precision temporal localization, bias-free evaluation, and scalable multimodal reasoning.