SPEECH-COCO: Multimodal Speech Dataset

Updated 17 June 2026

SPEECH-COCO is a large-scale multimodal dataset that pairs MSCOCO images with over 600,000 spoken captions generated via text-to-speech, complete with detailed temporal annotations.
The dataset employs a deterministic TTS pipeline with disfluency injection and speed perturbation to simulate natural speech variability and ensure reproducibility.
It enables practical applications such as spoken image captioning and multimodal ASR, demonstrating performance gains through integrated vision-speech alignment.

SPEECH-COCO is a large-scale corpus augmenting the original MSCOCO image dataset with over 600,000 visually grounded spoken captions synthesized via text-to-speech. Its design fosters research in multimodal environments by providing aligned audio, text, and vision data with extensive temporal annotations. The dataset has become a canonical resource for language and vision (LaVi) tasks, multimodal machine learning, and speech pattern discovery.

1. Composition and Scale

SPEECH-COCO extends the train+val splits of MSCOCO, encompassing 123,287 images, each paired with five human-authored textual captions. Each caption is synthesized into English speech using commercial concatenative TTS voices, resulting in 616,767 unique spoken captions and approximately 604 hours of audio. Both US and UK English variants are represented across eight speaker voices (Paul, Elizabeth, Judith, Bronwen, Phil, Bruce, Amanda, Jenny), each derived from independent 3,000-sentence training corpora to capture inter-speaker variability (Havard et al., 2017).

Key characteristics include:

Mean caption length: 10.79 tokens (words plus punctuation)
Mean audio duration per caption: 3.52 seconds
Caption length distribution: Approximately normal, centered at 8–14 tokens, with a tail extending beyond 20
Audio duration distribution: Peaked at around 3 seconds, spanning 1–7 seconds

2. Speech Synthesis and Variability Pipeline

The audio generation pipeline centers on emulated naturalism and diversity:

Synthesis: Speech is generated by the Voxygen concatenative TTS engine, randomly assigning one of the eight available voices to each caption.
Disfluency Injection: To simulate spontaneous speech, one filler from {“um”, “uh”, “er”, “huh”, “oh”, “ah”} is inserted per caption in 30% of cases, with the position uniformly sampled from start, middle, or end. This induces a formal disfluency rate per caption of $\text{DisfluencyRate} = 0.3 / \text{caption\_length}$ .
Speed Perturbation: Each synthesized utterance is transformed using sox’s tempo function (pitch preserving) with $f \in \{0.9, 1.0, 1.1\}$ applied with equal probability, modulating playback speed and correspondingly scaling all timecodes:

$\text{duration}_{\text{new}} = \frac{\text{duration}_{\text{orig}}}{f}, \qquad \text{timecode}_{\text{new}} = \frac{\text{timecode}_{\text{orig}}}{f}$

The generation algorithm is deterministic and fully scripted, ensuring reproducibility across the corpus.

3. Time Alignment and Data Format

For each utterance, a WAV audio file is paired with a structured JSON annotation, with fields:

duration (float)
speaker (string)
synthesisedCaption (string)
wavFilename (string)
imgID, captionID (integers)
speed (float)
disfluency (list of {token, position})
timecode (per-segment with {type: word/syllable/phoneme, label, start, end} in floating-point seconds)

Time alignment is intrinsic: all timecodes are directly output by the TTS synthesizer, without recourse to external forced alignment.

4. Availability, Splits, and Licensing

Data splits mirror MSCOCO, subdivided as:

Training: 82,783 images × 5 → 413,915 spoken captions
Validation: 40,504 images × 5 → 202,520 spoken captions
Test: Not released (due to MSCOCO protocol)

An additional subset of 10,000 captions (~10 hours) exists for benchmarking and preliminary experiments, notably in unsupervised term discovery tasks. Download and licensing information is specified on Zenodo: https://zenodo.org/record/4282267 (Havard et al., 2017).

5. Research Applications and Empirical Use

SPEECH-COCO facilitates a spectrum of LaVi and multimodal research including:

Spoken image captioning and image-to-speech systems
Visually grounded unsupervised term discovery (UTD)
Speech-to-image and text-to-speech retrieval
Joint speech–image representation learning
Speech-based visual question answering
Linguistic documentation simulations (speech elicitation from images)
Augmentation of low-resource speech corpora with systematically generated multimodal data

A preliminary study on UTD (10,000 captions, 10 hours, using ZRTools and TDE) reported a Normalized Edit Distance (NED) of 24.7, coverage 8.9%, matching precision 33.2% (recall 0.1%), and clustering F1 24.0% at a DTW threshold of 0.86, indicating the challenging nature of the synthetic speech — clusters show strong speaker specificity, low recall, and limited coverage, though common n-grams (e.g., “a man riding”, “fire hydrant”) are discoverable (Havard et al., 2017).

6. Dataset Integration, Preprocessing, and Best Practices

Researchers are advised to:

Exploit the alignment between speech, text, and vision via MSCOCO image IDs and provided splits
Use available Python scripts for data querying, filtering, and export (SQLite, Praat TextGrids)
Apply standard ASR toolkits (Kaldi, HTK, librosa) for speech feature extraction
For multimodal embedding, process speech features with CNN/seq2seq/transformer models co-trained with visual features
Preprocess audio by resampling (e.g., to 16 kHz), loudness normalization, and silence trimming
Optionally, reverse speed perturbations when precise TTS timecodes are necessary
Extract FBANK, MFCC, or filterbank features for downstream tasks and use speech segmentation (via JSON timecodes) for forced-alignment evaluation

7. Role and Outcomes in Downstream Multimodal ASR

SPEECH-COCO, also referred to as SpokenCOCO, is employed in recent multimodal speech recognition systems such as VHASR (Hu et al., 2024), which uses the synthetic utterances as the primary resource for audio–caption pairing. In VHASR, each MSCOCO image is directly paired with its five aligned speech captions without additional re-recording or synthesis. The TTS-generated (16 kHz) WAV files are used as-is for both training and evaluation.

VHASR's results on COCO show that leveraging the vision–speech alignment in SPEECH-COCO yields performance gains over unimodal ASR: merging its dual-stream outputs achieves a Word Error Rate (WER) of 9.59%–9.61% versus 10.44% for the unimodal baseline. In corruption experiments, VHASR retains higher recovery rates under noise due to the robust multimodal structure. These results confirm SPEECH-COCO's viability as a benchmark for advancing multimodal ASR architectures by supporting reproducible large-scale experiments with well-defined label and time-marked supervision (Hu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set (2017)

VHASR: A Multimodal Speech Recognition System With Vision Hotwords (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SPEECH-COCO.