Audio Moment Retrieval

Updated 20 November 2025

Audio Moment Retrieval is a task that locates precise temporal intervals in long audio recordings based on queries such as text or audio snippets.
It leverages transformer-based architectures and cross-modal attention to fuse audio, text, and visual modalities for accurate moment localization.
Key evaluations using metrics like Recall@K and mAP demonstrate its ability to handle fine temporal precision, especially in challenging short event scenarios.

Audio Moment Retrieval (AMR) is the task of temporally localizing one or more relevant intervals ("moments") within long audio recordings given an external query—which may be free-form text, an audio snippet, or another modality such as video. The AMR formulation is characterized by its explicit prediction of the temporal boundaries $(t_{\mathrm{start}}, t_{\mathrm{end}})$ of segments in untrimmed audio streams that are relevant to the query, requiring models to perform fine-grained grounding rather than coarse-level clip selection. AMR subsumes and generalizes previous tasks such as query-by-example matching and segment-level retrieval, and exhibits direct analogies with video moment retrieval and temporal grounding in vision and multimodal research (Munakata et al., 19 Nov 2025, Munakata et al., 2024, Xin et al., 2024).

1. Formal Problem Definition

Let $X$ denote a continuous audio signal of duration $T$ seconds and $q$ a query (commonly free-form text, but in some settings audio or multimedia). The AMR task is to predict a set $y = \{y_n\}_{n=1}^N$ , where each $y_n = (t^{n}_{\mathrm{start}}, t^{n}_{\mathrm{end}})$ is a subinterval $[t^{n}_{\mathrm{start}}, t^{n}_{\mathrm{end}}] \subset [0,T]$ best matched to $q$ . The model computes relevance $s_i = f(A_i, q)$ for discretized candidate segments $A_i$ (e.g., generated by a sliding window of width $w$ and hop $h$ ), ranking and selecting the top-K predictions.

Alignment of predictions to annotated boundaries is evaluated using Intersection over Union (IoU): $\mathrm{IoU}(\hat{y}, y) = \frac{|\ [\hat t_s,\hat t_e]\ \cap\ [t_s, t_e]\ |}{|\ [\hat t_s,\hat t_e]\ \cup\ [t_s, t_e]\ |}$ Metrics include Recall@ $K$ @ $τ$ (fraction of queries for which at least one of the top- $K$ predicted moments achieves $\mathrm{IoU} \geq τ$ with any ground truth) and mean Average Precision (mAP) over IoU thresholds (Munakata et al., 19 Nov 2025, Munakata et al., 2024, Xin et al., 2024).

2. Dataset Construction and Annotation

Initial progress in AMR relied on synthetic datasets due to the lack of real-world benchmarks. "Clotho-Moment" (Munakata et al., 2024) overlays captioned sound events on background city-soundstreams, providing large-scale simulated long recordings with moment labels. Manual evaluation was initially limited to 100 examples.

The emergence of CASTELLA (Munakata et al., 19 Nov 2025)—a human-annotated corpus targeting AMR—has established a real-world benchmark. CASTELLA comprises 1,862 audio recordings (train: 1,009; validation: 213; test: 640) of 1–5 minutes each, annotated through a three-stage crowdsourcing process: (a) selection of up to five local moments per recording, (b) captioning (local and global, initially Japanese then translated and reviewed into English), and (c) temporal boundary marking at 1 s resolution, permitting overlaps. Statistics include 3,881 local captions, 11,308 timestamps, an average of 2.1 local captions per recording, and moment length distribution biased towards short events (<5 s).

In the context of audio-visual and cross-modal grounding, datasets such as HIREST (Tu et al., 2024) and MGSV-EC ("Ad-Moment") (Xin et al., 2024) advance the study of grounding queries (e.g., video snippets, text) to precise moments within long audio/music tracks—often using pseudo-labels generated by audio alignment for large-scale annotation.

3. Model Architectures and Methodologies

DETR-Based Approaches

Most state-of-the-art AMR models adopt the detection-transformer (DETR) paradigm to handle set-based output and many-to-one query-to-moment associations. Notable instantiations include:

AM-DETR (Munakata et al., 2024): Sliding-window audio and text features are fused via cross-attention encoders, followed by transformer decoders with $K$ learnable queries. Each decoder head outputs a confidence score and predicted center-width parameters, mapped to segment boundaries. Supervision is provided via Hungarian matching between predicted and ground-truth segments, using $\ell_1$ regression, generalized IoU, and classification (foreground/background) losses.
CASTELLA Baseline (Munakata et al., 19 Nov 2025): Utilizes MS-CLAP for per-second audio embeddings and the CLAP text encoder, fuses modalities, and applies a DETR-style decoder. Pre-training on synthetic data (Clotho-Moment) is essential, with observed gains of 10.4 points in Recall@[email protected] upon subsequent fine-tuning with human-annotated data. UVCOM architectures demonstrate superior performance, whereas Moment-DETR underperforms.

Architecture	Pre-trained	Fine-tuned	Recall@[email protected]
QD-DETR	synthetic	✗	5.8
QD-DETR	✗	CASTELLA	9.7
QD-DETR	synthetic	CASTELLA	16.2
UVCOM	synthetic	CASTELLA	20.3
Moment-DETR	synthetic	CASTELLA	10.8

Performance degrades significantly for short events (<5 s), signifying continued limitations in temporal precision.

Systems such as QUAG (Tu et al., 2024) build on modality-synergistic perception (MSP)—leveraging symmetric InfoNCE loss to achieve global alignment between audio and other modalities (e.g., vision). Local fine-grained audio-visual interactions are modeled via cross-attention, followed by query-centric cognition (QC²) where deep query representations are used to filter both temporal and channel dimensions, dynamically masking irrelevant content. Ablations confirm that both global cross-modal alignment and query-selective filtering individually and jointly improve retrieval recall. QUAG achieves state-of-the-art [email protected] = 72.54% and [email protected] = 38.86% on HIREST.

In music and audio-visual grounding, methods employ two-stage pipelines (retrieval, then localization). For example, MaDe (ReaL) (Xin et al., 2024) first retrieves top candidate tracks using InfoNCE contrastive learning, and then predicts precise moment boundaries with a DETR-style decoder and alignment loss.

Query-by-example AMR has been addressed via robust audio fingerprinting systems (Singh et al., 2022), where CNNs with channel-wise spectral-temporal attention produce subfingerprints for windowed segments. Retrieval leverages locality-sensitive hashing for fast matching, followed by offset-consistency analysis for timestamp prediction, yielding localization accuracy within ±50 ms even under severe noise and reverb.

4. Training, Optimization, and Evaluation Protocols

Training procedures commonly use the AdamW optimizer with learning rates in $[1 \times 10^{-4}, 3 \times 10^{-4}]$ , batch sizes around 32 (AMR), 512 (music retrieval), and train for up to 100 epochs with early stopping or scheduled decay (Munakata et al., 19 Nov 2025, Xin et al., 2024, Munakata et al., 2024). Pre-training on synthetic data or with contrastive objectives is critical; models trained solely on real data underperform, as shown in empirical comparisons on CASTELLA. Fine-tuning with human-annotated data yields substantial performance gains.

Evaluation metrics for AMR include Recall@ $K$ @ $τ$ , with $τ$ typically set at 0.5 or 0.7, and mean Average Precision computed over a range of IoU thresholds. Comparisons on CASTELLA and HIREST demonstrate that real-data fine-tuning improves R@[email protected] by over 10 points compared to synthetic-only training (Munakata et al., 19 Nov 2025).

For query-by-example localization, estimators are evaluated by the proportion of predictions falling within ±50 ms of ground-truth boundaries, with accuracy remaining above 80% in noisy and reverberant conditions (Singh et al., 2022).

5. Variants, Modalities, and Generalizations

AMR extends naturally to text-audio, audio-audio, and multimodal queries, and is closely related to video moment retrieval, sound event detection (SED), and query-based video summarization. Approaches developed for video (e.g., detection transformers, attention-based grounding) are frequently adapted for AMR, with shared methodology including cross-modal contrastive pretraining, attention-based fusion, and temporal boundary regression (Munakata et al., 19 Nov 2025, Tu et al., 2024, Munakata et al., 2024).

Music moment retrieval, as in VMMR (Xin et al., 2024), formulates retrieval and localization for video queries aligned to background music, and shares architectural components with AMR. Extensions also consider environmental sound, speech, and audiobook retrieval, replacing encoders appropriately (e.g., wav2vec for speech).

6. Limitations, Challenges, and Future Directions

Despite rapid progress, several challenges remain. Datasets may exhibit language limitations (CASTELLA: English-only captions translated from Japanese), coarse temporal granularity (1 s resolution), and bias toward short moments (Munakata et al., 19 Nov 2025). Very brief event localization (<5 s) remains particularly challenging across architectures. Improving metric robustness, leveraging multilingual and multimodal annotations, and incorporating global scene context are noted directions for future benchmarks.

Research trajectories emphasize:

Combining global captions and context with local retrieval (Munakata et al., 19 Nov 2025).
Integrating SED architectures and loss functions (e.g., contrastive triplet loss) for enhanced alignment.
Improving transformer-based detectors with iterative refinement and learned temporal suppression (Munakata et al., 2024).
Scaling annotated datasets and supporting fine-grained, overlapping, and multi-query retrieval.
Exploring domain transfer among AMR, video, music, and environmental sound retrieval (Xin et al., 2024, Tu et al., 2024).

7. Summary Table of Notable AMR Datasets

Dataset	Size	Data Type	Annotation	Key Usage
Clotho-Moment	>40k clips	Synthetic	Simulated overlays	Pre-training, ablation
CASTELLA	1,862 recs	Real, human-label	Multi-moment, 1 s res.	Real-world benchmark
UnAV-100 sub	77 recs / 100q	Real, manual	Single-moment, 1 s res.	Early manual eval
MGSV-EC	4,050 tracks	Music, video	Pseudo-aligned	Music/VMMR grounding
HIREST	—	Video+Audio/Text	Multi-modal, manual	AV moment grounding

All evaluations and methodological advances underscore the centrality of transformer-based architectures with cross-modal attention, robust pre-training, and fine-grained annotations in advancing the state of Audio Moment Retrieval (Munakata et al., 19 Nov 2025, Munakata et al., 2024, Xin et al., 2024, Tu et al., 2024, Singh et al., 2022).