Cross-modal Alignment with Optimal Transport for CTC-based ASR (2309.13650v1)
Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external LLM (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer.
- J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
- A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, pp. 1764–1772, 2014.
- W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, pp. 4960-4964, 2016.
- S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, pp. 4835–4839, 2017.
- T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proc. of ACL, vol. 1, pp. 518–529, 2017.
- S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
- A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint, arXiv:1211.3711, 2012.
- J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proc. of ACML, pp. 1081-1093, 2019.
- J. Salazar, D. Liang, T. Nguyen, K. Kirchhoff, “Masked Language Model Scoring,” in Proc. of ACL, pp. 2699-2712, 2020.
- J. Ao, R. Wang, Z. Zhou, et al, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” arXiv preprint, arXiv:2110.07205, 2021.
- S. Khurana, A. Laurent, J. Glass,“SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation,” IEEE J. Sel. Top. Signal Process., 16(6), pp. 1493-1504, 2022.
- Y. Fujita, T. Komatsu, and Y. Kida, “Multi-sequence intermediate conditioning for ctc-based asr,” arXiv preprint, arXiv:2204.00175, 2022.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.
- M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint, arXiv:2301.13003, 2023.
- K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, pp. 8517-8521, 2022.
- K. Deng, Z. Yang, S. Watanabe, Y. Higuchi, G. Cheng, P. Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models,” in Proc. of ICASSP, pp. 8522-8526, 2022.
- M. Han, L. Dong, Z. Liang, M. Cai, S. Zhou, Z. Ma, B. Xu, “Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection,” in Proc. of ICASSP, pp. 8532-8536, 2022
- W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, pp. 896-900, 2020.
- K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, pp. 60-67, 2022.
- F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint, arXiv:1503.02531, 2015.
- K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, pp. 2203-2207, 2022.
- W. Wang, S. Ren, Y. Qian, S. Liu, Y. Shi, Y. Qian, M. Zeng, “Optimizing Alignment of Speech and Language Latent Spaces for End-To-End Speech Recognition and Understanding,” in Proc. of ICASSP, pp. 7802-7806, 2021.
- H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” arXiv preprint, CoRR abs/2209.02030, 2022.
- Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-Based End-To-End ASR with Bert-Enhanced Encoder,” in Proc. of ICASSP, pp. 1-5, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, pp. 5998-6008, 2017.
- N. Courty, R. Flamary, A. Habrard, A. Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Proc. of NIPS, pp. 3733-3742, 2017.
- H. Tseng, H. Lin, H. Hsuan, and Y. Tsao, “Interpretations of Domain Adaptations via Layer Variational Analysis,” arXiv preprint, CoRR abs/2302.01798, 2023.
- X. Lu, P. Shen, Y. Tsao, H. Kawai, “Unsupervised Neural Adaptation Model Based on Optimal Transport for Spoken Language Identification,” in Proc. of ICASSP, pp. 7213-7217, 2021.
- H. Lin, H. Tseng, X. Lu, Y. Tsao, “Unsupervised Noise Adaptive Speech Enhancement by Discriminator-Constrained Optimal Transport,” in Proc. of NeurIPS, pp. 19935-19946, 2021.
- Y. Zhou, Q. Fang, Y. Feng, “CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation,” arXiv preprint, arXiv:2305.14635, 2023.
- P. Le, H. Gong, C. Wang, J. Pino, B. Lecouteux, D. Schwab, “Pre-training for Speech Translation: CTC Meets Optimal Transport,” arXiv preprint, CoRR abs/2301.11716, 2023.
- Z. Chi, L. Dong, B. Zheng, S. Huang, X. Mao, H. Huang, and F. Wei, “Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment,” in Proc. of ACL, pp. 3418-3430, 2021.
- M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Proc. of NIPS, vol. 26, 2013.
- Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, pp. 1-5, 2017.
- Diederik P. Kingma, Jimmy Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
- https://huggingface.co/
- B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, J. Niu, “WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit,” in Proc. of INTERSPEECH, pp. 1661-1665, 2022.
- Xugang Lu (42 papers)
- Peng Shen (17 papers)
- Yu Tsao (199 papers)
- Hisashi Kawai (29 papers)