Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Speech Recognition (2105.11084v3)

Published 24 May 2021 in cs.CL, cs.SD, and eess.AS

Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Overview of "Unsupervised Speech Recognition"

This paper introduces "wav2vec-U," a novel approach to training speech recognition models without labeled data. The method leverages self-supervised representations to map unlabeled audio to phonemes using adversarial training. This approach notably advances unsupervised automatic speech recognition (ASR), showcasing substantial improvements over previous methodologies.

Core Contributions

  1. Self-supervised Representations: The research utilizes wav2vec 2.0 for self-supervised speech audio representation, significantly impacting the segmentation and mapping of speech to phonemes.
  2. Adversarial Training: The model employs adversarial techniques to learn phoneme mappings, presenting a novel implementation of GANs in unsupervised ASR.
  3. Unsupervised Metric for Model Validation: A unique cross-validation approach allows model development without labeled data, using LLM-based fluency and vocabulary usage to guide training.

Numerical Achievements

The paper reports a reduction in phoneme error rate (PER) on the TIMIT dataset from 26.1 to 11.3 compared to previous best unsupervised models. On the Librispeech benchmark, wav2vec-U achieves a word error rate (WER) of 5.9 on test-other, rivaling former supervised systems trained on 960 hours of labeled data.

Methodological Details

  • Speech Segmentation: Utilizes k-means clustering on wav2vec 2.0 representations, followed by mean-pooling and PCA for robust segment representations.
  • Text Pre-processing: Involves phonemization with silence token insertion, improving alignment with audio data processing.
  • Model Architecture: The generator is a simple convolutional neural network, indicating efficiency in parameter use, and outputs phoneme distributions from frozen wav2vec 2.0 representations.
  • Performance Across Languages: The method proves effective across multiple languages in the MLS dataset, demonstrating its versatility and robustness in low-resource settings like Kyrgyz and Tatar.

Implications and Future Directions

The findings suggest potential for expanding speech recognition capabilities to a vast number of world languages, currently underserved due to reliance on labeled datasets. Future research could explore:

  • Cross-lingual Phonemization Strategies: Addressing the dependence on language-specific phonemizers by developing universal phonemization techniques.
  • Segmentation Optimization: Refining segmentation techniques could further improve phoneme mapping precision, benefiting from variable-sized representation learning.
  • Enhanced Self-training: Further iterations and refinements in self-training strategies could yield improvements, particularly in low-resource settings.

This research represents a significant stride towards democratizing speech recognition technology, emphasizing the role of unsupervised methods in advancing AI models for global linguistic diversity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Alexei Baevski (39 papers)
  2. Wei-Ning Hsu (76 papers)
  3. Alexis Conneau (33 papers)
  4. Michael Auli (73 papers)
Citations (255)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com