From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition (2301.07851v1)
Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
- Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., “Streaming end-to-end speech recognition for mobile devices,” in Proc. ICASSP. IEEE, 2019, pp. 6381–6385.
- T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A.-r. Mohamed, G. Dahl, and B. Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural networks, vol. 64, pp. 39–48, 2015.
- C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP. IEEE, 2018, pp. 4774–4778.
- B. Li, R. Pang, T. N. Sainath, A. Gulati, Y. Zhang, J. Qin, P. Haghani, W. R. Huang, M. Ma, and J. Bai, “Scaling end-to-end models for large-scale multilingual asr,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 1011–1018.
- B. Li, R. Pang, Y. Zhang, T. N. Sainath, T. Strohman, P. Haghani, Y. Zhu, B. Farris, N. Gaur, and M. Prasad, “Massively multilingual asr: A lifelong learning solution,” in Proc. ICASSP. IEEE, 2022, pp. 6397–6401.
- B. Li, S.-y. Chang, T. N. Sainath, R. Pang, Y. He, T. Strohman, and Y. Wu, “Towards fast and accurate streaming end-to-end asr,” in Proc. ICASSP. IEEE, 2020, pp. 6069–6073.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
- Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu, “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
- Y. Zhang, D. S. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y. Xu, Y. Huang, S. Wang, et al., “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- A. Graves, “Sequence transduction with recurrent neural networks,” Representation Workshop, ICML, 2012.
- K. Hu, A. Bruguier, T. N. Sainath, R. Prabhavalkar, and G. Pundak, “Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models,” Proc. Interspeech 2019, pp. 2155–2159, 2019.
- S. Tong, P. N. Garner, and H. Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in Proc. of INTERSPEECH, no. CONF, 2017.
- S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater, “Pre-training on high-resource speech recognition improves low-resource speech-to-text translation,” In Proc. of NAACL, 2019.
- R. Fan, Y. Zhu, J. Wang, and A. Alwan, “Towards better domain adaptation for self-supervised models: A case study of child asr,” IEEE Journal of Selected Topics in Signal Processing, pp. 1242–1252, 2022.
- K. Tomanek, V. Zayats, D. Padfield, K. Vaillancourt, and F. Biadsy, “Residual adapters for parameter-efficient asr adaptation to atypical and accented speech,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 6751–6760.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
- C.-H. H. Yang, Y.-Y. Tsai, and P.-Y. Chen, “Voice2series: Reprogramming acoustic models for time series classification,” in International Conference on Machine Learning. PMLR, 2021, pp. 11 808–11 819.
- K.-W. Chang, W.-C. Tseng, S.-W. Li, and H.-y. Lee, “An exploration of prompt tuning on generative spoken language model for speech processing tasks,” in Proc. Interspeech 2022, pp. 5005–5009.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech 2020, pp. 5036–5040, 2020.
- S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,” Advances in neural information processing systems, vol. 30, 2017.
- A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee, “Large-scale multilingual speech recognition with a streaming end-to-end model,” Proc. Interspeech 2019, pp. 2130–2134, 2019.
- P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
- A. Bapna and O. Firat, “Simple, scalable adaptation for neural machine translation,” in Proc. EMNLP-IJCNLP, 2019, pp. 1538–1548.
- G. F. Elsayed, I. Goodfellow, and J. Sohl-Dickstein, “Adversarial reprogramming of neural networks,” in International Conference on Learning Representations, 2018.
- H. Yen, P.-J. Ku, C.-H. H. Yang, H. Hu, S. M. Siniscalchi, P.-Y. Chen, and Y. Tsao, “A study of low-resource speech commands recognition based on adversarial reprogramming,” arXiv preprint arXiv:2110.03894, 2021.
- K. Hambardzumyan, H. Khachatrian, and J. May, “Warp: Word-level adversarial reprogramming,” in Proc. ACL, 2021, pp. 4921–4933.
- F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli, “Pay less attention with lightweight and dynamic convolutions,” in International Conference on Learning Representations, 2018.
- L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5659–5667.
- R. Zhao, J. Xue, J. Li, W. Wei, L. He, and Y. Gong, “On addressing practical challenges for rnn-transducer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 526–533.
- J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C.-C. Chiu, et al., “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” arXiv preprint arXiv:1902.08295, 2019.
- V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” in INTERSPEECH, 2020.
- J. Bai, B. Li, Y. Zhang, A. Bapna, N. Siddhartha, K. C. Sim, and T. N. Sainath, “Joint unsupervised and supervised training for multilingual asr,” in Proc. ICASSP, 2022, pp. 6402–6406.
- G. Venkatesh, A. Valliappan, J. Mahadeokar, Y. Shangguan, C. Fuegen, M. L. Seltzer, and V. Chandra, “Memory-efficient speech recognition on smart devices,” in Proc. ICASSP, 2021, pp. 8368–8372.
- A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” In Proc. Interspeech, 2020.
- E. B. Zaken, Y. Goldberg, and S. Ravfogel, “Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models,” in In Proc. of NAACL, 2022, pp. 1–9.
- J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al., “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. ICASSP. IEEE, 2020, pp. 7669–7673.