Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR (2401.08992v1)
Abstract: The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.
- “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” in Proc. Interspeech, 2020.
- “Scaling end-to-end models for large-scale multilingual ASR,” in Proc. ASRU, 2021.
- “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
- “Massively multilingual ASR: A lifelong learning solution,” in Proc. ICASSP, 2022.
- “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
- “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- “Efficient domain adaptation for speech foundation models,” in Proc. ICASSP, 2023.
- OpenAI, “GPT-4 technical report,” 2023.
- “Joint unsupervised and supervised training for multilingual ASR,” in Proc. ICASSP, 2022.
- “Improving the latency and quality of cascaded encoders,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8112–8116.
- “A better and faster end-to-end model for streaming ASR,” in Proc. ICASSP, 2021.
- “A truly multilingual first pass and monolingual second pass streaming on-device ASR system,” in Proc. SLT, 2023.
- “A language agnostic multilingual streaming on-device ASR system,” in Proc. Interspeech, 2022.
- “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020.
- “Leveraging language ID in multilingual end-to-end speech recognition,” in Proc. ASRU, 2019.
- “Parameter-efficient transfer learning for NLP,” in Proc. ICML, 2019.
- “Exploiting adapters for cross-lingual low-resource speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 317–329, 2021.
- “Modular domain adaptation for Conformer-based streaming ASR,” in Proc. Interspeech, 2023.
- “Self-training with noisy student improves imagenet classification,” in Proc. CVPR, 2020.
- “Improved noisy student training for automatic speech recognition,” in Proc. Interspeech, 2020.
- “Meta-adapter: Efficient cross-lingual adaptation with meta-learning,” in Proc. ICASSP, 2021.
- “Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition,” in Proc. Interspeech, 2021.
- “Large-scale multilingual speech recognition with a streaming end-to-end model,” in Proc. Interspeech, 2019.
- “mSLAM: Massively multilingual joint pre-training for speech and text,” arXiv preprint arXiv:2202.01374, 2022.
- “Adversarial meta sampling for multilingual low-resource speech recognition,” in Proc. AAAI, 2021.
- “Building high-accuracy multilingual ASR with gated language experts and curriculum training,” arXiv preprint arXiv:2303.00786, 2023.
- “Mixture-of-expert conformer for streaming multilingual asr,” arXiv preprint arXiv:2305.15663, 2023.
- “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- Paul Voigt and Axel Von dem Bussche, “The EU general data protection regulation (GDPR),” A Practical Guide, 1st Ed., Cham: Springer International Publishing, vol. 10, no. 3152676, pp. 10–5555, 2017.
- “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
- “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” arXiv preprint arXiv:1902.08295, 2019.
- “A domain-specific supercomputer for training deep neural networks,” Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Hybrid autoregressive transducer (HAT),” in Proc. ICASSP, 2020.
- “FastEmit: Low-latency streaming ASR with sequence-level emission regularization,” in Proc. ICASSP, 2021.
- “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- “Attention is all you need,” in Proc. NeurIPS, 2017.
- “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, 2021.