Scaling A Simple Approach to Zero-Shot Speech Recognition (2407.17852v1)

Published 25 Jul 2024 in cs.CL

Abstract: Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.

Authors (3)

Jinming Zhao (26 papers)
Vineel Pratap (18 papers)
Michael Auli (73 papers)

Citations (3)

View on Semantic Scholar

Summary

Scaling A Simple Approach to Zero-Shot Speech Recognition

The paper "Scaling A Simple Approach to Zero-Shot Speech Recognition" by Jinming Zhao, Vineel Pratap, and Michael Auli introduces a scalable method for automatic speech recognition (ASR) that eliminates the need for labeled data, thereby facilitating support for a vast number of the world's 7,000+ languages.

Methodology

Universal Acoustic Model

A novel aspect of the proposed approach is its departure from the allophone-to-phoneme mappings typically used in ASR-2K models. Instead, the authors employ a direct text representation via romanization. By leveraging the uroman tool, the authors standardize texts from different languages into a common Latin-script. This uniformity enables the training of a universal acoustic model using wav2vec 2.0, fine-tuned with labeled data across 1,078 languages. The output of this model is also romanized text, simplifying downstream processing.

Zero-Shot Decoding

For decoding, the method requires only a word list in the target language. Uroman is applied to generate <word, uroman_text> pairs, facilitating lexicon-based beam search decoding. Enhancements are also made using an n-gram LLM based on publicly available datasets such as Panlex and Crúbadán, enabling broader language support.

Experimental Results

Comparison to ASR-2K

Experimental results demonstrate that the authors’ approach significantly outperforms ASR-2K. Particularly, a zero-shot acoustic model trained on a large multilingual dataset (MMS-lab and CommonVoice) reduces the character error rate (CER) by an average relative margin of 46%. The performance drop compared to a model trained on just CommonVoice (CV-only) is also substantial, highlighting the importance of a diverse training set.

Supervised Models

To understand the efficacy of the zero-shot approach against supervised models, the authors trained monolingual models on MMS-lab and FLEURS datasets. Here, the zero-shot system's performance is roughly 2.5 times worse than the supervised monolingual systems, which is a remarkable achievement given its broader applicability and lack of labeled training data.

Text Databases Utilization

By leveraging external databases such as Panlex and Crúbadán, the methodology proves viable for languages with limited text data. Crúbadán achieves competitive CER even when only a lexicon is used, whereas Panlex falls short due to its limited and noisier datasets.

Amount of Text Data for Lexicons

The paper investigates the minimum viable text data required to construct effective lexicons and LLMs. Results indicate that even a modest corpus of 5,000 utterances can result in a CER close to that achieved with in-domain FLEURS data, validating the practicality of the method for low-resource settings.

Implications and Future Work

The proposed uroman-based, zero-shot approach effectively democratizes ASR technology, making it accessible for many underrepresented languages without the need for extensive labeled datasets. This innovation poses significant implications for inclusion, especially in technology accessibility for indigenous and low-resource languages.

Future research can explore enhancing this framework by integrating transformer-based sequence-to-sequence models or leveraging improved unsupervised methods to further reduce the error rates. Additionally, expanding and refining text corpora in existing datasets can help improve system robustness and accuracy.

In conclusion, through this work, the authors provide a practical and scalable solution for zero-shot ASR, catapulting the inclusivity and reach of speech recognition systems. The promising results validate the potential of uroman encoding coupled with extensive multilingual training to overcome traditional limitations in ASR.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MichaelAuli/status/1817833477808517196

https://twitter.com/realmofresearch/status/1817249534776602697