KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos (1903.00216v1)

Published 1 Mar 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40\% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set. The demo (http://emnlp-demo.lakomkin.me/) and the crawler code (https://github.com/EgorLakomkin/KTSpeechCrawler) are publicly available.

Citations (19)

View on Semantic Scholar

Summary

The paper introduces KT-Speech-Crawler, a system that automatically constructs ASR datasets from YouTube using multiple heuristics.
It uses the YouTube Search API with heuristic filtering and forced alignment to refine closed captions into reliable speech-text pairs.
Integration of 200 hours of YouTube data reduced WERs significantly, demonstrating the crawler’s strong impact on ASR performance.

Automatic Dataset Construction for Speech Recognition via YouTube Videos

The paper "KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos" introduces a novel approach to augmenting annotated datasets in the domain of Automatic Speech Recognition (ASR). The presented system, KT-Speech-Crawler, leverages the vast repository of YouTube videos, particularly those accompanied by user-generated closed captions, to automatically construct large-scale datasets for training end-to-end neural speech recognition systems.

Crawler Architecture and Data Collection

KT-Speech-Crawler is designed to efficiently identify videos with useful speech data. The crawler employs the YouTube Search API, using common English words to maximize the retrieval of videos with English closed captions. It addresses the challenge of filtering high-quality samples through a systematic process comprising multiple heuristic-based filtering steps aimed at ensuring the integrity and alignment of extracted text with corresponding speech.

Filtering involves discarding overlaps, removing music content, stripping non-ASCII elements, and checking language consistency through cross-verification with Google ASR API transcriptions. Subsequent post-processing further refines these excerpts by grouping proximate captions and adjusting misaligned boundaries using forced alignment techniques.

Impact on ASR Model Performance

Empirical evaluations highlight the positive impact of integrating samples obtained by the KT-Speech-Crawler into existing datasets. Notably, the addition of 200 hours of YouTube-derived samples to the Wall Street Journal dataset results in a significant word error rate (WER) reduction from 27.4% to 15.8%. Similarly, improvements are observed when augmenting the TED-LIUM v2 dataset, with WER and character error rates both showing marked enhancements.

The paper showcases the robustness and adaptability of the KT-Speech-Crawler collected samples across different ASR benchmarks. It underscores the utility of these additional samples, even in isolation, where training solely on YouTube-sourced data achieves a competitive WER, albeit demonstrating the augmented value when combined with domain-specific datasets.

Quality and Limitations

A crucial aspect of the paper is the manual assessment of transcription accuracy, estimating a 3.5% WER on sampled subsets, revealing primarily end-bound abbreviation errors. Such insights are critical for future refinements in transcription verification processes.

The system's limitations include potential variances induced by accents, environmental noise, and the presence of text-to-speech generated audio, highlighting the importance of continued refinements in filtering algorithms. The authors suggest future enhancements may involve leveraging neural modules for improved accuracy, speaker recognition to filter synthetic voices, and domain-specific dataset generation via metadata analysis.

Theoretical and Practical Implications

KT-Speech-Crawler paves the way for more accessible and cost-effective approaches to constructing large, diverse, and free speech datasets, which traditionally have been resource-intensive. By addressing data scarcity challenges, especially for researchers without access to proprietary datasets, it fosters an environment for experimental ASR advancements and diversified application across various domains.

Future Directions

The potential extension of this system to other languages, alongside the integration of more sophisticated neural networks to handle transcription imprecisions, represents promising avenues for future research. Ensuring the availability of extensive, high-quality datasets can accelerate progress in ASR, supporting the development of more generalized and robust speech recognition systems applicable to a multitude of real-world scenarios.

In conclusion, KT-Speech-Crawler exemplifies an efficient paradigm for the scalable collection and integration of speech data from publicly accessible platforms like YouTube. It stands to significantly impact future innovations in speech recognition by democratizing access to extensive and valid datasets for the research community.

PDF Markdown

Related Papers

GitHub

GitHub - EgorLakomkin/KTSpeechCrawler: Automatically constructing corpus for automatic speech recognition from YouTube videos (151 stars)