Papers
Topics
Authors
Recent
2000 character limit reached

Google Speech Commands v2 Corpus

Updated 11 November 2025
  • Google Speech Commands v2 is a large-scale dataset for limited-vocabulary speech recognition that comprises 105,829 utterances and diverse acoustic conditions.
  • It features rigorous data collection, quality control, standardized splits, and balanced classes including core commands, digits, unknown words, and background noise.
  • The corpus supports multiple feature extraction methods like MFCCs and spectrograms and provides baseline evaluations for both closed-set and streaming tasks.

The Google Speech Commands version 2 (GSC v2) corpus is a large-scale, publicly available dataset designed for training and evaluation of limited-vocabulary speech recognition and keyword spotting systems. Originating from the work of Warden (2018), GSC v2 defines rigorous protocols for data collection, speaker diversity, annotation, data splits, feature extraction, and baseline evaluation to facilitate reproducible and comparable research in command word recognition and robust deployment in real-world, noisy environments.

1. Dataset Composition and Classes

GSC v2 comprises 105,829 labeled utterances spanning multiple keyword categories, as well as dedicated background noise audio assets. The vocabulary is partitioned as follows:

  • Numeric Digits (10 Words): zero, one, two, three, four, five, six, seven, eight, nine
  • Core Commands (14 Words): yes, no, up, down, left, right, on, off, stop, go, backward, forward, follow, learn
  • Unknown Words (10 Test Words): bed, bird, cat, dog, happy, house, marvin, sheila, tree, wow
  • Background Noise ("silence"): Long-form WAV files (5–10 files, each ≈1 minute) in a dedicated _background_noise_ folder

Each labeled utterance is a short audio clip (≤1 s) stored as a single-channel, 16-bit PCM WAV sampled at 16,000 Hz. The data set is uncompressed at ~3.8 GB and ~2.7 GB as a tar-gzip archive. Speaker diversity was achieved by recruiting 2,618 unique speakers, each contributing up to a single session, recorded using consumer-grade laptop or phone microphones via the WebAudioAPI (Chrome, Firefox, Android). Environmental diversity is introduced by recording in real-world acoustic contexts (typically rooms with doors closed, including ambient natural noise) and by supplying explicit background noise files (e.g., running water, machinery, synthesized white/pink noise).

The following table summarizes utterance counts for representative classes:

Word #Utterances Word #Utterances
backward 1,664 five 4,052
bed 2,014 follow 1,579
bird 2,064 four 3,728
cat 2,031 go 3,880
dog 2,128 learn 1,575
yes 4,044 zero 4,052

This table is excerpted; full per-word statistics appear in the corpus documentation.

2. Data Collection and Quality Control

All utterances follow a standard protocol: sampling rate 16,000 Hz, 16-bit encoding, with a target duration of 1 second or less. Each recording session presents 135 prompts to the speaker (core words repeated 5 times, auxiliary keywords once).

Automatic quality control includes:

  1. Deletion of OGG files smaller than 5 KB (too silent/short)
  2. Transcoding to 16 kHz WAV via ffmpeg
  3. Mean absolute sample amplitude filtering (discarding samples with mean <0.004, normalized to [–1,1])
  4. Extraction and centering of the loudest contiguous 1 s segment

Manual verification is implemented as a single-worker, crowdsourced transcription task, with any utterance showing a label-transcription mismatch excluded from the release.

3. Organization, Naming, and Versioning

Data organization is structured for scalable reproducibility:

  • Folders: Each word label corresponds to a subfolder; background noise in _background_noise_
  • Files: Each WAV is named {speakerID}_{instanceID}.wav, where speakerID is an 8-digit hexadecimal hash and instanceID uniquely identifies the utterance
  • Splits: validation_list.txt and testing_list.txt explicitly index validation/test data
  • Speaker-independence: All a given speaker’s utterances reside in the same data split by construction, as the split index is determined by a hash of the filename

Key differences from version 1 to version 2:

  • Four new commands: backward, forward, follow, learn
  • Utterances increased from 64,727 to 105,829
  • Speaker count increased from 1,881 to 2,618
  • Introduction of a standard streaming audio test set (1 hour, mixed commands plus noise)
  • Enhanced quality control and deduplication through browser cookie tracking

4. Preprocessing and Feature Extraction

The preprocessing pipeline enforces acoustical and statistical consistency:

  • Resampling: All audio normalized to 16 kHz, 16-bit PCM WAV
  • Silence Removal: Clips with low mean amplitude are excluded
  • Alignment: The loudest 1 s window is extracted per utterance
  • Optional Augmentation: Random time-shifts (±100 ms), mixing with background noise files, and volume scaling

For feature extraction, two canonical representations are used in the baseline benchmarks:

  • Spectrograms: 30 ms Hann window, 10 ms stride
  • MFCCs: 40 bands, with optional delta and delta-delta coefficients

All examples are normalized per-utterance for mean and variance in the recommended pipeline.

5. Training, Validation, and Test Set Construction

Dataset partitioning ensures strict speaker-independence and reproducibility:

  • Training Set: All files not indexed in validation or test lists
  • Validation Set: ≈5,836 utterances for hyperparameter optimization
  • Test Set: ≈15,965 utterances for final reporting

Hashing the full filename ensures consistent assignment across versions and releases. All samples from a given speaker reside within a single partition.

6. Evaluation Protocols and Baselines

GSC v2 prescribes a closed-set 12-class classification task for system evaluation:

  • Task Classes: 10 core commands (yes, no, up, down, left, right, on, off, stop, go), one "unknown" (random samples from the set of 20 non-target words), and one "silence" (clips from background noise).
  • Distribution: Balanced across classes (≈8.3% per class).

The primary metric is Top-One Accuracy:

Accuracy=1Ni=1N1(y^i=yi)\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i)

where y^i\hat{y}_i is the system’s prediction and yiy_i is the ground truth. For streaming tasks (continuous 10 min/1 h audio), metrics include:

  • Matched-percentage: Proportion of correct detections within ±750 ms of ground-truth R
  • Wrong-percentage: False alarms where the predicted class is incorrect
  • False-positive percentage: Detections in non-speech segments

Baseline model architecture (termed "conv") consists of two convolutional layers (first: 64 filters, 20×8; second: 64 filters, 10×4; both ReLU), a 128-unit fully connected layer with 0.5 dropout, and a softmax output over 12 classes. Feature input is 40-dimensional MFCC per frame (100 frames per 1 s utterance). Optimization employed stochastic gradient descent (initial learning rate 0.001, decayed to 0.0001), batch size 100, 18,000 training steps.

Training Data V1 Test Accuracy (%) V2 Test Accuracy (%)
V1 training 85.4 82.7
V2 training 89.7 88.2

For streaming evaluation on the V2 streaming set:

  • Matched: 49.0%
  • Correct (within-class): 46.0%
  • Wrong: 3.0%
  • False positives: 0.0%

7. Recommendations, Licensing, and Resources

Data augmentation via volume perturbation, time-shifting, and noise mixing (with the supplied _background_noise_ files) is strongly advised for robust training. The baseline feature pipeline utilizes 40-band MFCC or mel-spectrogram extraction with per-utterance normalization. Training should employ stratified sampling for “unknown” and “silence,” and validation performance should be monitored on the provided score split. Researchers are encouraged to use the provided validation/testing lists and streaming test audio to facilitate direct comparison.

The corpus is released under the Creative Commons Attribution 4.0 license (CC BY 4.0). Citation instructions and official resources are enumerated in Warden (2018) (Warden, 2018). All baseline code, split indices, and streaming evaluation tools are distributed via the TensorFlow Speech Commands repository.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Google Speech Commands (GSC v2) Corpus.