Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised ASR via Cross-Lingual Pseudo-Labeling (2305.13330v3)

Published 19 May 2023 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Recent work has shown that it is possible to train an $\textit{unsupervised}$ automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is $\textit{always}$ labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an $\textit{unsupervised}$ AM in a new language. Here, "unsupervised" means no labeled audio is available for the $\textit{target}$ language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the $\textit{target}$ language using some $\textit{other}$ language AM and (ii) constraining these PLs with a $\textit{target LLM}$. Our approach is effective on Common Voice: e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Investigating syllabic structures and their variation in spontaneous french. Speech communication, 46(2):119–139, 2005.
  2. Unsupervised automatic speech recognition: A review. Speech Communication, 2022.
  3. Rosana Ardila et al. Common Voice: A Massively-Multilingual Speech Corpus. LREC, 2020.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 2020.
  5. Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
  6. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp. 1298–1312. PMLR, 2022.
  7. Continuous pseudo-labeling from the start. arXiv preprint arXiv:2210.08711, 2022a.
  8. More speaking or more speakers? arXiv preprint arXiv:2211.00854, 2022b.
  9. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  8–15. IEEE, 2021.
  10. Maestro-u: Leveraging joint speech-text representation learning for zero supervised speech asr. arXiv preprint arXiv:2210.10027, 2022.
  11. Yu-An Chung and et al. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In ASRU, pp.  244–250. IEEE, 2021.
  12. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  13. Multilingual training of deep neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp.  7319–7323. IEEE, 2013.
  14. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pp.  369–376, 2006.
  15. K. Heafield. KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011.
  16. Improved mask-ctc for non-autoregressive end-to-end asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  8363–8367. IEEE, 2021a.
  17. Momentum pseudo-labeling: Semi-supervised asr with continuously improving pseudo-labels. IEEE Journal of Selected Topics in Signal Processing, pp. 1–14, 2022. doi: 10.1109/JSTSP.2022.3195367.
  18. Yosuke Higuchi et al. Momentum pseudo-labeling for semi-supervised speech recognition. Interspeech, 2021b.
  19. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  20. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
  21. Self-training for end-to-end speech recognition. In ICASSP, 2020.
  22. Jacob Kahn et al. Flashlight: Enabling innovation in tools for machine learning. arXiv preprint arXiv:2201.12465, 2022.
  23. Squeezeformer: An efficient transformer for automatic speech recognition. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=gE_vt-w4LhL.
  24. Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR. In Proc. Interspeech 2022, pp.  2288–2292, 2022. doi: 10.21437/Interspeech.2022-10170.
  25. Ethnologue: languages of the world, dallas, texas: Sil international. Online version: http://www. ethnologue. com, 12(12):2010, 2009.
  26. Asr2k: Speech recognition for around 2000 languages without audio. arXiv preprint arXiv:2209.02842, 2022.
  27. slimIPL: Language-Model-Free Iterative Pseudo-Labeling. Interspeech, 2021.
  28. Towards end-to-end unsupervised speech recognition. arXiv preprint arXiv:2204.02492, 2022.
  29. Pseudo-labeling for massively multilingual speech recognition. ICASSP, 2022.
  30. Vimal Manohar and et al. Kaizen: Continuously improving teacher using exponential moving average for semi-supervised speech recognition. In ASRU. IEEE, 2021.
  31. LibriSpeech: an ASR corpus based on public domain audio books. In ICASSP, 2015.
  32. Specaugment: A simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pp.  2613–2617, 2019.
  33. Improved noisy student training for automatic speech recognition. Interspeech, 2020.
  34. Vineel Pratap et al. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. Interspeech, 2020.
  35. Transliteration based data augmentation for training multilingual asr acoustic models in low resource settings. In INTERSPEECH, pp.  4736–4740, 2020.
  36. Ashish Vaswani et al. Attention is all you need. In NeurIPS, 2017.
  37. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
  38. Richard Wiese. The phonology of German. Oxford University Press, 2000.
  39. Iterative pseudo-labeling for speech recognition. Interspeech, 2020.
  40. Self-training and pre-training are complementary for speech recognition. In ICASSP, 2021.
  41. TorchAudio: Building Blocks for Audio and Speech Processing. arXiv preprint arXiv:2110.15018, 2021.
  42. Yu Zhang and et al. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 2022.
  43. Zi-Qiang Zhang et al. Xlst: Cross-lingual self-training to learn multilingual representation for low resource speech recognition, 2021.

Summary

We haven't generated a summary for this paper yet.