Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification (2312.07338v1)

Published 12 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeuRIPS, 2020.
  3. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  4. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  5. “A theory of learning from different domains,” Machine Learning, vol. 79, pp. 151–175, 05 2010.
  6. “Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training,” in Proc. Interspeech 2021, 2021, pp. 721–725.
  7. “Measuring the impact of individual domain factors in self-supervised pre-training,” arXiv preprint arXiv:2203.00648, 2022.
  8. “How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 205–212.
  9. “Don’t stop pretraining: Adapt language models to domains and tasks,” in ACL 2020, 2020, pp. 8342–8360.
  10. “Unsupervised continual semantic adaptation through neural rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3031–3040.
  11. “Unsupervised cross-lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020.
  12. “XLS-R: self-supervised cross-lingual speech representation learning at scale,” in Interspeech 2022, Hanseok Ko and John H. L. Hansen, Eds. 2022, pp. 2278–2282, ISCA.
  13. “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
  14. “mslam: Massively multilingual joint pre-training for speech and text,” arXiv preprint arXiv:2202.01374, 2022.
  15. “Fleurs: Few-shot learning evaluation of universal representations of speech,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805.
  16. “The Flores-101 evaluation benchmark for low-resource and multilingual machine translation,” TACL, vol. 10, pp. 522–538, 2022.
  17. “Lifting the curse of multilinguality by pre-training modular transformers,” in NAACL 2022, Seattle, United States, July 2022, pp. 3479–3495, Association for Computational Linguistics.
  18. “Multilingual tedx corpus for speech recognition and translation,” in Proceedings of Interspeech, 2021.
  19. “Using deep neural networks for identification of slavic languages from acoustic signal,” Proc. Interspeech 2018, pp. 1803–1807, 2018.
  20. “Rediscovering the slavic continuum in representations emerging from neural models of spoken language identification,” arXiv preprint arXiv:2010.11973, 2020.
  21. “Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages,” in Proc. INTERSPEECH 2023, 2023, pp. 3984–3988.
  22. “Speech recognition challenge in the wild: Arabic mgb-3,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 316–322.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com