GigaSpeech: Multi-Domain ASR Dataset
- GigaSpeech is a comprehensive, multi-domain English speech recognition corpus with 10,000 hours of high-quality labeled audio and 40,000 hours of recorded speech.
- It employs a scalable forced alignment and segmentation pipeline using specialized techniques to ensure quality sentence-level transcriptions for ASR training.
- The dataset features rigorously curated training subsets and baseline implementations across major toolkits, providing a robust foundation for supervised and semi-supervised ASR research.
GigaSpeech is an evolving, multi-domain English speech recognition corpus comprising approximately 10,000 hours of high-quality labeled audio and a total of 40,000 hours of recorded speech, facilitating both supervised and semi-supervised training for automatic speech recognition (@@@@2@@@@) research (Chen et al., 2021). The dataset integrates audio from diverse sources—namely audiobooks, podcasts, and YouTube—spanning a broad spectrum of speaking styles and subject domains, and introduces a scalable forced alignment and segmentation pipeline to ensure quality and consistency in its transcribed sentence-level segments. With granular training subsets; rigorously curated development and test sets; and baseline implementations in Athena, ESPnet, Kaldi, and Pika toolkits, GigaSpeech provides a critical foundation for benchmarking and advancing ASR systems.
1. Corpus Composition and Data Sources
GigaSpeech is constructed to maximize domain diversity and linguistic coverage. The full dataset consists of approximately 40,000 hours of recorded audio, with 10,000 hours of these having high-quality transcripts suitable for supervised ASR training. Its three primary sources yield distinct audio characteristics:
- Audiobooks: Featuring clear, well-articulated read speech.
- Podcasts: Offering both read and spontaneous conversational speech.
- YouTube: Providing unscripted, informal, and sometimes noisy conditions.
The corpus spans 24 manually curated topical categories—including arts, science, sports, business, education, and entertainment—systematically capturing variation in accent, speaking style, and vocabulary distribution.
2. Segmentation, Alignment, and Text Normalization Pipeline
GigaSpeech employs a multi-stage, automated segmentation pipeline explicitly designed to create sentence-level segments for training state-of-the-art ASR models. The key steps are:
- Text Normalization: All transcripts are normalized for casing, symbol removal, and standardized rendering of numbers and dates.
- Forced Alignment: A Kaldi-based aligner processes audio and text chunks. The forced alignment leverages a biased LLM and a modified Smith–Waterman algorithm to map transcript segments to decoded hypotheses; this modification robustly handles punctuation and silences for precise segment boundaries.
- Audio Segmentation: Segments are created at silences longer than 1 second or at punctuation marks (comma, period, question mark, exclamation point) with at least 0.2 seconds gap. Segments exceeding 20 seconds or with alignment word error rate (WER) are discarded. Silences at boundaries are trimmed to 0.15 seconds.
- Segment Validation: A specialized forced alignment graph introduces “leaky” arcs and garbage word loops, allowing insertions, deletions, and substitutions to filter low-quality transcriptions.
- Reference Rewriting: For the XL subset, the system employs a filler loop and disfluency detector to further refine transcriptions, ensuring common speech phenomena do not result in discarding valid segments.
Notably, punctuation is encoded using explicit LaTeX tokens (e.g., COMMA, PERIOD), which supports both enriched punctuation tagging in end-to-end ASR models and high-precision endpoint detection.
3. Training Subsets and Validation Criteria
To accommodate different experimental paradigms, GigaSpeech is split into five training subsets by size:
Subset | Approximate Size (hours) | Max Allowed WER |
---|---|---|
XS | 10 | 0% |
S | 250 | 0% |
M | 1,000 | 0% |
L | 2,500 | 0% |
XL | 10,000 | 4% (podcast/YouTube) |
For subset creation, the XL (10,000h) includes segments from podcasts/YouTube with a lenient filtering criterion (maximum WER 4%), while all smaller set segments and audiobook portions of the XL set impose a strict zero-error threshold, ensuring only perfectly aligned transcripts are included. Reference rewriting is applied for XL only; all smaller subsets maintain rigid validation.
4. Evaluation Sets and Transcription Quality
The evaluation framework relies on two particular sets:
- DEV: ~12.5 hours, designed for system development and hyperparameter tuning.
- TEST: ~40.3 hours, reserved for final benchmarking.
Both sets are sourced from podcasts and YouTube audio to maintain challenging conditions, but their transcriptions are manually crafted by professional transcribers and reprocessed to guarantee very high accuracy. This strict curation supports reliable, reproducible ASR evaluations and bolsters comparability with established evaluation corpora such as LibriSpeech.
5. Baseline Implementations and System Benchmarks
GigaSpeech offers baseline recipes across four major ASR toolkits, enabling direct benchmarking and methodological comparison:
- Athena: Encoder–decoder transformer, trained with joint CTC and RNNLM-based beam search.
- ESPnet: Conformer (CNN + Transformer) architectures; SentencePiece tokenization; advanced optimization/augmentation.
- Kaldi: Chain model leveraging GMM-HMM alignment, i-vector extraction, volume and speed perturbation, and DNN trained with cross-entropy and LF-MMI; decoding with 4-gram LM and RNNLM rescoring.
- Pika: RNN-T with a convolutional + transformer encoder and a multi-layer transformer decoder; MBR training and transformer-based forward/backward rescorers.
These systems are designed for widespread reproducibility and rapid experimentation, establishing strong baselines for further research.
6. LaTeX-Encoded Technical Details and Model Design Implications
The paper introduces technical notation and segmentation criteria, critical for model developers:
- Alignment WER Filtering: Segments with alignment WER are discarded.
- Special Token Usage: Punctuation tokens are rendered as COMMA, PERIOD, etc. in LaTeX, supporting both learning of punctuation and endpoint detection in modern end-to-end ASR models.
- Alignment Graphs: Leaky arcs and garbage word loops in the forced alignment graph accommodate wide phonetic variability by allowing transitions to lower n-gram states or insertion/deletion operations.
These design decisions have direct implications for modeling approaches, facilitating effective integration of various linguistic cues (e.g., punctuation restoration) and enabling the training of robust punctuation-aware speech models.
7. Significance and Impact on ASR Research
GigaSpeech advances the state of ASR corpora by coupling large-scale, multi-domain audio with a principled, quality-assured pipeline for transcription and segmentation. Its flexible subset design makes it suitable for both resource-constrained experiments and large-scale training, while carefully crafted evaluation sets support rigorous benchmarking. Baseline systems in major toolkits further enhance its utility and reproducibility. These features collectively foster progress in supervised and semi-supervised learning scenarios, support robust development of punctuation-aware and domain-adaptive ASR systems, and remain foundational for subsequent extensions such as GigaST (Ye et al., 2022) and GigaSpeech 2 (Yang et al., 17 Jun 2024). The dataset's structure and methodologies have become widely cited in the field, enabling scalable advances in state-of-the-art ASR technologies.