Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts (2311.01070v3)
Abstract: Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
- Alec Radford et al., “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
- Sanchit Gandhi et al., “Esb: A benchmark for multi-domain end-to-end speech recognition,” arXiv preprint arXiv:2210.13352, 2022.
- Alexei Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, 2020.
- Naveen Arivazhagan et al., “Massively multilingual neural machine translation in the wild: Findings and challenges,” 2019.
- Alexis Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in ACL, 2022.
- Naman Goyal et al., “Larger-scale transformers for multilingual masked language modeling,” in RepL4NLP, 2021.
- Victor Sanh et al., “Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter,” CoRR, vol. abs/1910.01108, 2019.
- Alireza Mohammadshahi et al., “SMaLL-100: Introducing shallow multilingual machine translation model for low-resource languages,” in EMNLP, 2022.
- Jonas Pfeiffer et al., “Lifting the curse of multilinguality by pre-training modular transformers,” in NAACL-HLT, 2022.
- Vineel Pratap et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- Biao Zhang et al., “Share or not? learning to schedule language-specific capacity for multilingual translation,” in ICLR, 2021.
- Edward J Hu et al., “Lora: Low-rank adaptation of large language models,” in ICLR, 2021.
- Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021.
- Sanyuan Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, 2022.
- Alexis Conneau et al., “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in Interspeech, 2021.
- Arun Babu et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Interspeech, 2022.
- Yu Zhang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
- Geoffrey Hinton et al., “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
- Zhijie Shen et al., “Language-universal adapter learning with knowledge distillation for end-to-end multilingual speech recognition,” arXiv preprint arXiv:2303.01249, 2023.
- Yuqiao Wen et al., “f-divergence minimization for sequence-level knowledge distillation,” in ACL, 2023.
- Dongyoung Go et al., “Aligning language models with preferences through f𝑓fitalic_f-divergence minimization,” in ICML, 2023.
- Hang Le et al., “Lightweight adapter tuning for multilingual speech translation,” in ACL-IJCNLP, 2021.
- Edward Gow-Smith et al., “Naver labs europe’s multilingual speech translation systems for the iwslt 2023 low-resource track,” arXiv preprint arXiv:2306.07763, 2023.
- Anastasopoulos Antonios et al., “Findings of the iwslt 2022 evaluation campaign.,” in IWSLT, 2022.
- Bethan Thomas et al., “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in ICASSP, 2022.
- Katrin Tomanek et al., “Residual adapters for parameter-efficient asr adaptation to atypical and accented speech,” in EMNLP, 2021.
- Srijith Radhakrishnan et al., “A parameter-efficient learning approach to arabic dialect identification with pre-trained general-purpose speech model,” arXiv preprint arXiv:2305.11244, 2023.
- Minghan Wang et al., “WhiSLU: End-to-End Spoken Language Understanding with Whisper,” in INTERSPEECH, 2023.
- “Peft-ser: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models,” arXiv preprint arXiv:2306.05350, 2023.
- Rosana Ardila et al., “Common voice: A massively-multilingual speech corpus,” in LREC, 2020.
- Alexis Conneau et al., “Fleurs: Few-shot learning evaluation of universal representations of speech,” in IEEE Spoken Language Technology Workshop, SLT 2022. IEEE, 2023.
- Thomas Wolf et al., “Transformers: State-of-the-art natural language processing,” in EMNLP: system demonstrations, 2020.
- Thomas Palmeira Ferraz (9 papers)
- Marcely Zanon Boito (18 papers)
- Caroline Brun (7 papers)
- Vassilina Nikoulina (28 papers)