Papers
Topics
Authors
Recent
2000 character limit reached

SpokenCOCO2Mix and SpokenCOCO3Mix Datasets

Updated 13 September 2025
  • The paper introduces SpokenCOCO2Mix and SpokenCOCO3Mix, offering controlled evaluation of target speaker retrieval in overlapping speech environments.
  • The methodology integrates a TSRE module with speaker-aware extraction to significantly improve Recall@1 performance over non-conditioned baselines.
  • The findings underscore the datasets' impact on real-world applications such as assistive robotics, smart devices, and secure access control.

SpokenCOCO2Mix and SpokenCOCO3Mix are large-scale, synthetic multimodal datasets specifically constructed for the evaluation of speech-image retrieval systems in multi-speaker environments. Both datasets build on the foundational SpokenCOCO corpus, augmenting it with libriMix-derived overlapped speech from two or three speakers per utterance, respectively. They are central testbeds in the development of target-speaker-aware retrieval frameworks, enabling rigorous assessment of models that integrate target speaker extraction, such as those employing the Target Speaker Retrieval Extractor (TSRE) module, in scenarios involving overlapping speech.

1. Dataset Construction and Structure

SpokenCOCO2Mix and SpokenCOCO3Mix are engineered to simulate realistic speech mixtures where multiple voices overlap in the audio channel, closely mirroring naturalistic "cocktail party" conditions as encountered in public, home, or meeting environments. Both datasets are derived from the original SpokenCOCO collection, which associates images with clean, single-speaker utterances. The multi-speaker variants employ the LibriMix codebase for the systematic generation of audio mixtures.

Dataset Speakers per Utterance Utterances Images Total Hours
SpokenCOCO2Mix 2 254,200 57,830 ~368
SpokenCOCO3Mix 3 254,200 57,830 ~397

Each sample in these datasets comprises an audio mixture containing two (SpokenCOCO2Mix) or three (SpokenCOCO3Mix) speakers reading independent image captions, paired with a corresponding target image. This setup challenges retrieval models to attend to a specified target speaker, ignoring interfering voices.

2. Role in Target Speaker Speech-Image Retrieval

The principal motivation for SpokenCOCO2Mix and SpokenCOCO3Mix is to provide robust, semantically-indexed evaluation scenarios for the Target Speaker Speech-Image Retrieval task. This task requires systems to associate a given image with a spoken query produced by a target speaker, amidst interfering speech from others. The datasets enable quantitative benchmarking under controlled increases in auditory scene complexity, critical for measuring the efficacy of target speaker extraction mechanisms.

Unlike traditional single-speaker datasets, these mixes ensure that evaluation necessarily requires the isolation or conditioning on the designated speaker, exposing the limitations of classical and single-speaker retrieval models in the presence of speech overlap.

3. Experimental Protocol and Baseline Performance

Experimental setups on these datasets follow a speaker-aware retrieval protocol. Given a pre-enrolled target speaker embedding (e.g., from ECAPA-TDNN), systems process the audio mixture, aiming to extract and align the target's utterance semantics with corresponding images via a shared embedding space.

Performance metrics focus on Recall@1 and related retrieval measures in both directions (speech-to-image and image-to-speech). Notably, baseline models that lack explicit target speaker conditioning (e.g., using only the mixture waveform) experience severe performance degradation in multi-speaker scenarios:

  • On SpokenCOCO2Mix, a WavLM baseline achieves only 12.6% Recall@1 (speech-to-image).
  • On SpokenCOCO3Mix, the same baseline drops to 4.8% Recall@1.

By contrast, systems integrating the TSRE module significantly improve performance under identical conditions.

4. Integration with Target Speaker-Aware Modeling

SpokenCOCO2Mix and SpokenCOCO3Mix serve as the critical evaluation backbone for models incorporating the TSRE design. The TSRE module employs target speaker conditioning at multiple abstraction levels within the audio encoding pipeline:

  • Speaker-Conditional LayerNorm (SCL): Applies FiLM-style modulation of normalization parameters determined by the target speaker embedding.
  • Speaker-Conditional Convolution (SCC) and SCC-B: Introduce short-term extraction and parameter reduction mechanisms dynamically modulated by speaker cues.

Loss formulations are adapted accordingly, e.g., for a mixture xKx^K and target speaker upu^p:

es∣up=Es′(xK,up)e_{s|u^p} = E'_s(x^K, u^p)

with bidirectional contrastive loss promoting alignment between es∣upe_{s|u^p} and the image embedding eie_i. This setup ensures that retrieval accuracy is specifically attributed to extraction of the correct speaker within the simulated polyphonic context.

5. Significance for Multimodal and Real-world Applications

The introduction of SpokenCOCO2Mix and SpokenCOCO3Mix enables rigorous assessment of models for human-computer interaction scenarios where simultaneous speakers are the norm. Real-world applications activated by these datasets include:

  • Assistive Robotics: Robust command following for specific users in conversational or noisy environments.
  • Multimodal Kiosks and Smart Devices: Reliable operation by filtering background or interfering speech.
  • Access Control and Security: Ensuring action is triggered only by authorized speakers amidst competing speech.

By systematically increasing mixture complexity, these datasets allow for ablation studies and controlled scaling, highlighting the limits and benefits of target speaker-aware approaches.

6. Comparative Results and Analysis

Systematic performance evaluation on SpokenCOCO2Mix and SpokenCOCO3Mix demonstrates the necessity and effectiveness of advanced extraction modules. TSRE-equipped models achieve:

  • 36.3% Recall@1 (speech-to-image) on SpokenCOCO2Mix,
  • 29.9% Recall@1 on SpokenCOCO3Mix,

outperforming prior state-of-the-art and CLN-based methods by up to 7.8% in Recall@1, and substantially mitigating multi-speaker degradation relative to single-speaker-trained systems.

7. Technical and Methodological Implications

The design and adoption of SpokenCOCO2Mix and SpokenCOCO3Mix delineate a shift toward evaluation standards that emphasize environmental robustness over idealized, noise-free setups. Their effective utilization demands architectures capable of integrating speaker conditioning, dynamic attention, and contrastive alignment, directly reflecting the requirements of complex real-world audio-visual retrieval. The datasets represent a foundational benchmark for ongoing research in speech-image retrieval under realistic, non-stationary auditory scenes.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SpokenCOCO2Mix and SpokenCOCO3Mix.