Papers
Topics
Authors
Recent
2000 character limit reached

Google Speech Commands v2 Benchmark

Updated 27 December 2025
  • Google Speech Commands v2 is a large-scale benchmark for spoken keyword recognition, featuring 35 command classes and fixed train, validation, and test splits.
  • It supports advanced augmentation methods such as SpecAugment and ImportantAug to enhance model robustness and energy efficiency in diverse deployment scenarios.
  • Evaluations on GSC v2 include TCNs, spiking neural networks, and attention-based models, achieving accuracies up to 96.92% and optimized inference for edge devices.

Google Speech Commands Version 2 (GSC v2) is a large-scale, open benchmark for spoken keyword recognition, widely used for research in keyword spotting (KWS), speech command classification, data augmentation strategies, efficient deep models, and spiking neural networks. The corpus and its fixed splits are commonly employed as a reference testbed for small-footprint and energy-efficient speech recognition models, supporting the development and rigorous evaluation of novel architectures, augmentation methods, and robust inference protocols.

1. Dataset Composition and Preprocessing

GSC v2 comprises 35 spoken-word command classes such as "yes", "no", "up", "down", "left", "right", and additional common words, sampled at 16 kHz with each utterance capped at 1 s duration. The dataset provides fixed splits: approximately 84,843 training, 9,981 validation (“development”), and 11,005 test utterances, for a total of 105,829 labeled examples. Each class typically contains about 3,000–3,200 examples in the training split. Audio is distributed as 1 s 16-bit PCM mono WAV files.

Feature extraction for neural models typically involves Mel spectrograms (e.g., 140-bin Mel-spectrograms, hamming/hann window 25–32 ms, hop 8–10 ms, yielding 100–126 time frames) or log-amplitude STFT spectrograms (257 bins × 126 frames). MFCCs (40 bins, frame shift 10 ms) are also used in KWS; feature pipelines are standardized for reproducibility (Trinh et al., 2021, Wang et al., 11 Nov 2025, Wang et al., 17 Dec 2024, Wang et al., 2021).

Noisy or “playback” conditions are often simulated through additive mixing with reverberant speech (using LibriTTS) or music (MUSAN), with reverberation modeling performed by gpuRIR using sampled room sizes (10–50 m²), T60s (0.2–0.6 s), and precise micro-loudspeaker-mic geometry (e.g., 5 cm separation) to create realistic device overlap scenarios (Cornell et al., 2021).

2. Data Augmentation Methodologies

Multiple augmentation regimes are supported and have been essential for advancing state-of-the-art accuracy and robustness:

  • SpecAugment: Frequency masking (1 mask, 10 bins) and time masking (1 mask, 25% of frames) are widely used for regularization, both in spectral and Mel domains (Wang et al., 11 Nov 2025, Wang et al., 17 Dec 2024).
  • ImportantAug: A learned importance mask Mθ(f,t)M_\theta(f,t) is predicted and rolled, with noise added only to low-importance spectro-temporal regions, demonstrated to reduce error by 23.3% vs. conventional noise injection at fixed SNR, and by 25.4% vs. the no-augmentation baseline (Trinh et al., 2021).
  • On-the-fly Mixing: Device-playback is mimicked by randomly pairing utterances, applying random time shifts (15–20 frames), and mixing at SIR \sim Uniform(–20 dB, +3 dB) (Cornell et al., 2021).
  • No Augmentation: Some studies, especially those focused on architectural ablations or metric learning, avoid augmentation to isolate the effects of architecture and loss functions (Wang et al., 2021, Andrade et al., 2018).

Additional artificially noisy test sets, such as GSC-MUSAN and GSC-QUT, are constructed by injecting in-domain/out-of-domain noise at various SNRs for stress-testing model generalization (Trinh et al., 2021).

3. Model Architectures and Training Protocols

GSC v2 is the de facto benchmark for evaluating both conventional and neuromorphic speech command models, including:

  • TCN-based KWS: Small footprint temporal convolutional networks (131 k parameters) with log-Mel features and receptive fields of 117 frames (1.17 s). Variants include iAEC-M (latent masking) and iAEC-C (feature concatenation) for reference-aware models to achieve implicit acoustic echo cancellation in playback scenarios (Cornell et al., 2021).
  • Spiking Neural Networks (SNNs): Models such as SpikeSCR and SpikCommander use Mel spectrogram inputs to spike embeddings via LIF neurons, leveraging global-local hybrid attention or multi-view spiking self-attention modules. These architectures achieve parameter-efficient, energy-aware KWS with 1.12–3.30 M parameters, achieving up to 96.92% accuracy, and can leverage curriculum-based knowledge distillation to halve power/latency without significant accuracy loss (Wang et al., 11 Nov 2025, Wang et al., 17 Dec 2024).
  • Attention-based RNNs/ConvNets: Convolutional layers extract frame-level features, passed through bidirectional LSTMs and attended by simple dot-product attention mechanisms. Such models, even with only ≈202 k parameters, reach 94.5% accuracy for 20-command tasks (Andrade et al., 2018).
  • Metric Learning with Text Anchors: LG-Net explicitly models long-short term temporal structure via alternation of 1D convolution and multi-head self-attention (LG block), with BERT-derived linguistic anchors improving generalization and narrowing false-reject rates, reaching 96.79% top-1 accuracy with only 313 k parameters. No data augmentation is used to focus on loss/architecture contributions (Wang et al., 2021).

Training is consistently performed with cross-entropy or multi-task objectives (e.g., metric + classification loss, or CE + KL divergence for distillation), AdamW optimizer, batch sizes in the 64–256 range, and learning rate schedules such as cosine annealing. Early stopping on validation/“development” set performance is standard (Wang et al., 17 Dec 2024, Wang et al., 2021, Andrade et al., 2018).

4. Results and Comparative Evaluation

Performance of recent models on GSC v2 is summarized in the table below:

Model Params (M) Accuracy (%) Test Time Steps Augmentation
SpikCommander (2L–16–256) 2.13 96.92 100 SpecAugment
SpikeSCR (2L–16–256) 3.15 96.08 100 SpecAugment+KDCL
LG-Net6+CE+Text Anchor 0.313 96.79 ≈98 frames None
TC-ResNet14-1.5+CE+TT 0.313 96.27 ≈98 frames None
iAEC-M (best, 35-word) 0.131 94.97 117 frames SpecAug/Playback
Attention-RNN (20-word) 0.202 94.5 124 frames None
ImportantAug (NN baseline) 95.00 Learned masking

For playback/device-overlap conditions, iAEC-M models yield up to 84.2%–83.8% under strong interference, compared to a drastic drop to 61.7% for the baseline TCN. Masking-based fusion (iAEC-M) consistently outperforms concatenation (iAEC-C) and heavy AEC cascades (15 M FLOPs) with only a 1.5× cost over the baseline in playback and none in non-playback (Cornell et al., 2021).

Spiking Transformer models (SpikCommander) and hybrid SNNs (SpikeSCR) surpass prior SNNs and match or exceed low-footprint ANN baselines (e.g., LMUFormer: 96.53%). Using KDCL, energy consumption can be reduced by 54.8% with <0.5% accuracy loss (Wang et al., 11 Nov 2025, Wang et al., 17 Dec 2024).

For 16-class tasks, text anchor metric learning boosts LG-Net6 from 96.45% to 96.79%, while FRR at 0.5% FAR reaches 3.56% (vs 4.69% for CE only) (Wang et al., 2021).

5. Ablation Analyses and Architectural Insights

Ablations across models highlight critical contributions:

  • Removal of multi-view attention in SpikCommander (V-branch or SWA-STASA) reduces top-1 GSC v2 accuracy by 0.7–1.0%; removing SCR-MLP or the spike embedding extractor causes larger drops (2.0–4.0%), underscoring the importance of context fusion and spike-optimized feature interfaces (Wang et al., 11 Nov 2025).
  • In SpikeSCR, omitting SpecAugment (SAM) or rotary positional encoding (RoPE) reduces accuracy by 0.91% and 1.25%, respectively; replacing the global-local hybrid with ordinary convs degrades accuracy by over 3% (Wang et al., 17 Dec 2024).
  • iAEC-M achieves the lowest FRR and best compute-accuracy tradeoff by applying masking at intermediate feature depths (e.g., ResBlock D2, FOV=28 frames). Reference-aware masking is especially beneficial in playback, but is switched off in non-playback for compute efficiency (Cornell et al., 2021).
  • LG-Net consistently outperforms state-of-the-art CNNs (e.g., TC-ResNet) even with only CE loss and no data augmentation; text anchor triplet loss is reliably superior to speech anchor loss for KWS, particularly under adverse conditions (Wang et al., 2021).

6. Robustness, Efficiency, and Deployment Considerations

The GSC v2 benchmark underpins research into robust KWS, energy-efficient deployment, and generalization:

  • Energy per inference for state-of-the-art SNNs (SpikeSCR, SpikCommander) can reach sub-0.04 mJ per clip at 40–100 time steps, crucial for edge deployment (Wang et al., 17 Dec 2024, Wang et al., 11 Nov 2025).
  • Curriculum learning-based distillation enables substantial reduction in time steps (100→40, or 500→100), thereby lowering latency and energy while maintaining <0.5% performance drop (Wang et al., 17 Dec 2024).
  • Masked data augmentation (ImportantAug) effectively withstands both in-domain (MUSAN) and out-of-domain (QUT) noise, outperforming fixed SNR or null-masking baselines (Trinh et al., 2021).
  • Text-based metric learning is less sensitive to speaker and noise variation, as shown in t-SNE projections and evaluated by FRR metrics (Wang et al., 2021).

Spiking and efficient-ANN KWS models demonstrated on GSC v2 are now a standard reference point for neuromorphic and embedded speech AI, with code and reproducible splits widely available.

7. Current Limitations and Areas for Further Research

The GSC v2 corpus, while standardized and rich, is limited to short, isolated utterances, and may not fully reflect open-set, far-field, or continuous command recognition challenges. Noisy and playback-augmented variants help address robustness, but further work is needed to assess real-world generalization. Many leading architectures restrict augmentation or regularization in order to clarify architectural contributions; future work is required to unify best practices from augmentation, metric learning, and neuromorphic paradigms. The absence of speaker meta-data (for many GSC v2 tasks) and limited phrase set (mainly 35 words) present additional barriers to extensibility and scaling, motivating the need for larger, more varied datasets for transferability and zero-shot KWS research (Cornell et al., 2021, Wang et al., 11 Nov 2025, Wang et al., 17 Dec 2024, Wang et al., 2021, Trinh et al., 2021, Andrade et al., 2018).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Google Speech Commands Version 2 (GSC v2).