GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio (2106.06909v1)

Published 13 Jun 2021 in cs.SD, cs.CL, and eess.AS

Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

Citations (305)

View on Semantic Scholar

Summary

The paper presents a novel 10,000-hour ASR dataset that overcomes limitations of traditional corpora by incorporating diverse audio sources.
It leverages a robust pipeline featuring forced alignment and segmentation to ensure high transcription accuracy across various domains.
The dataset enables scalable experimentation and sets new benchmarks for ASR performance, paving the way for advanced neural models.

GigaSpeech: An Extensive and Versatile ASR Dataset

The paper "GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio" introduces an ambitious dataset for automatic speech recognition (ASR), addressing the growing need for extensive and diverse training data. The authors present a comprehensive corpus amounting to 10,000 hours of annotated audio, sourced from varied platforms such as audiobooks, podcasts, and YouTube. This corpus offers significant potential for both supervised and semi-supervised learning approaches in ASR.

Motivation and Context

The stagnation observed in speech recognition corpora is a critical bottleneck in the advancement of ASR technologies. The industry’s reliance on aging datasets like the Wall Street Journal or Switchboard, which primarily feature limited hours of read or conversational telephone speech, has resulted in performance saturation. Notably, even the TED-LIUM and SPGISpeech are insufficient for developing highly versatile algorithms due to their constrained domains or limited sizes. GigaSpeech endeavors to fill these gaps by providing data that is not only voluminous but also rich in acoustic and topical diversity.

Data Composition and Features

The corpus encapsulates 40,000 hours of audio data, of which 10,000 hours have been stringently transcribed for high-quality training outcomes. GigaSpeech sets itself apart with its inclusivity across various dimensions:

Multi-source and Multi-style: Captures both read and spontaneous speech from numerous sources.
Richly Multi-topic: Encompasses a wide array of subjects including arts, science, and sports, thereby reflecting natural linguistic variability.
Scalable Subsets: Divided into multiple training subsets (XS, S, M, L, and XL) ranging from 10 to 10,000 hours, allowing for scalable experimentation.
Enhanced Transcriptions: Incorporates original and normalized transcript pairs to support end-to-end system training with text post-processing.

Methodological Contributions

A notable contribution of this work is the development of a robust pipeline for creating clean and coherent speech recognition data. This pipeline encompasses forced alignment and segmentation techniques optimized for establishing reliable annotation of spoken content. Forced alignment ensures precise segmentation, crucial for minimizing transcription errors—a key feature validated by the data validation stage. Additionally, the corpus incorporates novel mechanisms for text normalization and alignment error detection, facilitating high integrity in transcription quality.

Baseline Systems and Evaluations

The paper details baseline experiments conducted using popular ASR toolkits such as Athena, ESPnet, Kaldi, and Pika. The performance metrics, especially word error rates (WER), indicate promising results for each toolkit on the XL subset, demonstrating the utility and effectiveness of GigaSpeech's comprehensive dataset. While these results serve as baselines, they pave the way for future advancements in ASR models by providing a versatile training ground.

Implications and Future Directions

GigaSpeech opens new avenues for researchers to explore diversified ASR models capable of handling extensive variability in speech. With its evolving nature, the corpus promises continual updates and expansions, potentially incorporating more metadata for tasks like speaker identification. The presented pipeline model can serve as an archetype for generating other large-scale datasets across different languages and domains. This approach could significantly aid the next generation of ASR systems, particularly those leveraging deep learning paradigms which thrive on large volumes of high-quality data.

In conclusion, GigaSpeech stands as an invaluable resource propelling both practical and theoretical developments in ASR. Its introduction marks a dedicated effort towards overcoming the limitations of existing datasets, thus supporting the field’s progression toward more sophisticated and generalized ASR systems.

PDF Markdown