Papers
Topics
Authors
Recent
2000 character limit reached

Audio-Based Grounding Dataset

Updated 2 December 2025
  • Audio-based grounding datasets are curated resources that link raw or processed audio signals to visual or spatial data, enabling cross-modal localization and segmentation.
  • They employ diverse modalities and annotation protocols—including pixel-level, temporal, and class-level labels—to support tasks like spoken video grounding and 3D point cloud segmentation.
  • These datasets drive advances in multimodal systems through rigorous benchmarking metrics such as mIoU, recall, and accuracy, despite challenges like noisy conditions and synthetic speech.

An audio-based grounding dataset is a curated or constructed resource that explicitly links raw or processed audio signals to spatial, temporal, or semantic elements in another modality, most commonly vision (images, video, or 3D point clouds). These datasets are central to the development, evaluation, and benchmarking of models that localize, match, or segment visual elements using sound—especially when the grounding cue is given as speech or non-speech audio rather than canonical text. The field now encompasses tasks including temporal localization in video from spoken utterances, segmentation of 2D/3D visual regions from environmental sounds, spatial grounding of impacts or events given audio, and even affordance segmentation cued by action sounds.

1. Key Dataset Types and Scope

Audio-based grounding datasets are heterogeneous in their technical design, modalities, and target tasks. The major axes of differentiation include:

  • Modality pairing: Audio–video (spoken video grounding, temporal alignment), audio–image (segmentation, object localization), audio–3D (audio-conditioned 3D point cloud grounding), audio–robotics (robotic manipulation via multimodal input), and audio–question-answer (audio-visual instruction and QA).
  • Annotation granularity: Pixel-level (segmentation masks), bounding box, temporal interval, or class-level labels.
  • Audio type: Raw environmental sounds, impact sounds, speech (human or TTS), or synthetic mixtures (speech plus sound).
  • Dataset scale: Varies from thousands to hundreds of thousands of examples.

For large-scale spoken video grounding, the ActivityNet Speech dataset exemplifies direct audio–video query alignment, offering over 70,000 spoken queries temporally aligned to unconstrained ActivityNet videos with synthetic yet challenging noise augmentation (Xia et al., 2022). For spatial segmentation grounded by single-word speech, the single-word audio-guided image segmentation dataset includes 66,202 image–audio pairs paired with 35 object-class keywords spanning 35 accents, supporting robust, cross-accent grounding (Santos et al., 27 Nov 2025). In 3D, Audio-3DVG generated 172,995 synthetic audio descriptions for 2,714 3D indoor scenes, with referential utterances mapped to ScanNet-derived real-world environments (Cao-Dinh et al., 1 Jul 2025). Task coverage further includes pixel-level affordance mask prediction from action sounds (AV-AG/AVAGD, 12,768 images/masks, 5,203 sounds, 97 objects, 55 affordances) (Lu et al., 1 Dec 2025), grounding of real pointwise impact sounds to object mesh locations (RealImpact, 150,000 multichannel samples, 50 objects, 600 microphone poses each) (Clarke et al., 2023), and simultaneous speech/non-speech grounding in images with controlled mixtures (Extended-IS3, 6,840 mixtures) (Ryu et al., 24 Mar 2025).

2. Annotation Protocols and Technical Construction

Annotation schemes and design methodology are highly task-specific but share key rigor:

  • Temporal alignment: For activity grounding from speech, segment times are inherited or mapped from existing text-captioned datasets, e.g., ActivityNet or movie audio description corpora (e.g., MAD, 384,000 temporally anchored sentences in 1,207 h video) (Soldan et al., 2021).
  • Spatial supervision: Segmentation- and detection-oriented datasets propagate or manually annotate per-pixel or bounding-box regions (e.g., IS3/Extended-IS3, AV-AG).
  • Sound-event labeling: In text-to-audio grounding, sound events are identified via automated phrase chunking (NP/VP extraction), then manually merged and temporally segmented in the audio (Xu et al., 2021).
  • Audio manuscript and diversity: Audio sources may be crowdsourced (spontaneous speech or actors/readers, e.g., ActivityNet Speech), synthetic via TTS (Audio-3DVG, single-word), or curated from environmental recordings (RealImpact, ASPED v.b (Kim et al., 23 Sep 2025)).
  • Multimodal QA: Benchmarks such as audio-visual QA datasets involve human-authored audio-attentive questions with dual annotator passes and inter-annotator agreement metrics (κ=0.82 for AudioVisQA (Sagare et al., 21 Jul 2024)).

Realistic background noise is often injected (ActivityNet Speech mixes ESC-50 noise at α ∈ [0.5, 0.7]), and close attention is paid to class balance, coverage, and split protocols, with many resources providing stratified “seen/unseen” splits (AV-AG zero-shot setting, single-word dataset with held-out images/utterances).

3. Data Formats, Splits, and Statistical Properties

Audio-based grounding datasets typically standardize input modalities and splits:

  • Audio: 16 kHz PCM WAV for speech and environmental sounds, occasionally 48 kHz for multichannel impact capture (RealImpact).
  • Visual: Untrimmed videos, still images, or 3D point clouds (ScanNet-derived PCDs in Audio-3DVG), all mapped to their audio referents.
  • Masks/Labels: Pixel-accurate masks (PNG), bounding boxes (xyxy), temporal intervals, or class indicators.
  • Splits: Standard train/validation/test, stratified by utterance, scene, or object-action pair, or a single held-out evaluation set for benchmarking (Extended-IS3).
  • Examples:

| Dataset | #Audio | #Visuals | Split | Annotation | |---------------------------|----------:|------------:|:------------|:---------------------| | ActivityNet Speech | 70,000 | 20,000 | 37k/18k/17k | Temporal intervals | | AV-AG (AVAGD) | 5,203 | 12,768 | seen/unseen | Pixel masks (dual) | | Audio-3DVG | 173,000 | 2,714 | per dataset | 3D bbox, class label | | RealImpact | 150,000 | 50 | per object | Impact position, RGBD | | Single-word Segmentation | ~2,300 | 66,202 | 50k/8k/8k | Pixel mask | | Extended-IS3 | 6,840 | 3,420 | eval only | Pixel masks, transcript| | ASPED v.b | N/A | N/A | 1,056/132/132h | Presence flag, 1 fps |

Common notational conventions include V={vi}i=1NvV = \{v_i\}_{i=1}^{N_v} for visual objects, Q={qj}Q = \{q_j\} for queries, AA or FF for audio-derived features, and [τs(j),τe(j)][τ_s^{(j)}, τ_e^{(j)}] for temporal localization.

4. Supported Tasks and Benchmarking Protocols

Audio-based grounding datasets are usually designed to support one or more of the following tasks:

  • Temporal video grounding: Localization of a video fragment corresponding to a spoken or audio query, with evaluation via R@KR@K (recall at K proposals) at IoU thresholds m{0.3,0.5,0.7}m \in \{0.3,0.5,0.7\}, and mean IoU (mIoU) (Xia et al., 2022, Soldan et al., 2021).
  • Spatial segmentation or detection: Pixel-level or region-level mask prediction given an audio cue, evaluated via mIoU, F-score, or object-level mAP (Santos et al., 27 Nov 2025, Lu et al., 1 Dec 2025).
  • Simultaneous mixed-audio grounding: Two or more overlapping sources (e.g., speech plus sound), with separate segmentation heads and disentanglement loss (Extended-IS3) (Ryu et al., 24 Mar 2025).
  • 3D grounding/localization: Audio-conditioned retrieval of 3D bounding boxes or proposals, with [email protected]/0.5 and top-1 accuracy (Cao-Dinh et al., 1 Jul 2025).
  • Object recognition and weakly supervised grounding: Classification of the object mentioned in paired speech and image, and multimodal retrieval (Moriya et al., 2019).
  • Audio-visual question answering: Open-ended QA with audio-aware questions, measured by accuracy, exact match, and F1 metrics (Sagare et al., 21 Jul 2024).
  • Pedestrian or event detection: Audio-based presence prediction aligned with visual frames, with balanced accuracy in noise-dominant environments (Kim et al., 23 Sep 2025).
  • Affordance segmentation: Given an action sound, segment out the functional and dependency regions in the corresponding object image; performance is measured by mIoU and F-score (Lu et al., 1 Dec 2025).

Loss functions include binary cross-entropy (for masks or framewise activity), LIoUL_{IoU}, cross-entropy for classification, ranking or alignment losses, and specific disentanglement objectives for mixed-audio domains.

5. Data Access, Licensing, and Limitations

Access protocols and licensing regimes vary:

6. Significance, Research Impact, and Open Challenges

Audio-based grounding datasets have catalyzed advances in:

  • End-to-end learning: They enabled direct training from raw audio without textual intermediates, increasing the robustness of multimodal systems to linguistic and acoustic variability (Santos et al., 27 Nov 2025).
  • Sim-to-real evaluation: Standardized impact/affordance datasets (RealImpact, AV-AG) provide calibration testbeds bridging simulation and real-world sound spatialization (Clarke et al., 2023).
  • Zero-shot and generalization analysis: Explicit “unseen” splits facilitate probing of model extrapolation to novel objects or affordances, which is rarely covered in canonical AVS or 3DVG settings (Lu et al., 1 Dec 2025).
  • New tasks: Datasets such as Extended-IS3 support joint and disentangled visual grounding from mixed audio, and Audio-VisQA introduces multi-turn, audio-attentive, open-ended question answering (Ryu et al., 24 Mar 2025, Sagare et al., 21 Jul 2024).

Outstanding challenges include scaling multimodal coverage, refining annotation for real-world and noisy conditions, and extending beyond segmentation or bounding box formats to richer scene graphs or narrative-based grounding. The convergence of large-scale datasets, audio-visual foundation models, and nuanced benchmarking will continue to drive the development of robust, generalizable audio-based grounding systems.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Audio-Based Grounding Dataset.