Libri-Light: A Benchmark for ASR with Limited or No Supervision (1912.07875v1)

Published 17 Dec 2019 in cs.CL, cs.SD, and eess.AS

Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Citations (611)

View on Semantic Scholar

Summary

The paper introduces Libri-Light, a 60,000-hour benchmark for evaluating ASR in zero-resource, semi-supervised, and distant supervision settings.
It employs unsupervised techniques like Contrastive Predictive Coding to extract robust phonetic features, outperforming traditional MFCC baselines in ABX evaluations.
Findings suggest that pre-training and pseudo-labeling substantially improve ASR performance, paving the way for innovations in low-resource language research.

Libri-Light: A Benchmark for ASR with Limited or No Supervision

This paper introduces Libri-Light, a substantial corpus derived from LibriVox audio books, designed to advance the development of Automatic Speech Recognition (ASR) systems in settings with limited or no supervision. With over 60,000 hours of audio, it constitutes one of the largest openly available speech datasets, making it a significant resource for researching weakly supervised learning techniques in ASR.

Motivation and Challenges

Traditional ASR methodologies rely heavily on extensive annotated datasets, which are not feasible for many low-resource languages due to the prohibitive costs of manual transcription. Consequently, the research focuses on developing effective models that operate with minimal labeled data. The paper addresses this challenge by proposing benchmarks that cater to zero-resource, semi-supervised, and distant supervision settings.

Dataset and Metrics

The Libri-Light corpus is organized into several components, including unlabelled speech training sets (spanning 600 to 60,000 hours), limited resource training subsets, and unaligned text sets for LLM training. Such a structure facilitates various experimental setups:

Zero-Resource/Unsupervised (ABX): Evaluates models that discover phonetic units without supervision.
Semi-Supervised (PER, CER): Analyzes models trained on minimal annotated data.
Distant Supervision (WER): Involves pre-training on untranscribed speech and leveraging large textual corpora.

Each setup aligns with established benchmarks such as LibriSpeech, allowing for direct comparisons with supervised learning models.

Methodology and Baseline Systems

The paper provides baseline systems employing unsupervised training techniques like Contrastive Predictive Coding (CPC), demonstrating its effectiveness over traditional MFCC features. The model architecture comprises convolutional layers and recurrent units optimized for extracting robust phonetic representations. The baselines also explore the potential benefits of pseudo-labeling and LLM integration during the distant supervision stage.

Results and Analysis

The unsupervised results show competitive ABX scores, suggesting that CPC-based embeddings are more phonetically informative than standard baselines. In the semi-supervised context, pre-training notably enhances performance on limited-label datasets. The findings imply that unsupervised feature learning holds substantial promise for ASR systems where annotated data is scarce.

In distant supervision, despite the performance gap with fully supervised systems, the results underscore the efficacy of increased unsupervised pre-training and pseudo-labeling. These approaches enhance WER, showcasing the potential of leveraging extensive unlabelled corpora for ASR advancements.

Implications and Future Directions

The introduction of Libri-Light has vital implications for the ASR community, providing a comprehensive benchmark for evaluating novel methodologies in resource-constrained environments. This work opens the door to further exploration of large, unlabelled datasets coupled with innovative training techniques like adversarial learning and cross-lingual transfer.

The paper hints at numerous pathways for future research:

Optimization of model architectures for better integration of learned phonetic representations.
Exploration of broader linguistic domains using pseudo-labeling techniques.
Application of domain adaptation strategies to enhance model generalization across varied acoustic conditions.

Libri-Light offers a robust foundation for fostering continued innovations in unsupervised and semi-supervised ASR research, paving the way for practical deployment in diverse linguistic settings. The open-source nature of datasets and baseline implementations ensures accessibility for researchers worldwide, enhancing the collaborative progression of ASR capabilities.

PDF Markdown