Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR (2309.16093v1)
Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained LLM (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any LLM), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.
- A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, pp. 1764–1772, 2014.
- J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
- S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
- Y. Higuchi, K. Karube, T. Ogawa, T. Kobayashi, “Hierarchical conditional end-to-end asr with ctc and multi-granular subword units,” in Proc. of ICASSP, pp. 7797-7801, 2022.
- Y. Fujita, T. Komatsu, and Y. Kida, “Alternate Intermediate Conditioning with Syllable-Level and Character-Level Targets for Japanese ASR,” in Proc. of SLT, pp. 76-83, 2022.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint arXiv:2301.13003, 2023.
- K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, pp. 8517-8521, 2022.
- W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, pp. 896-900, 2020.
- K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, pp. 60-67, 2022.
- F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
- Y. Kubo, S. Karita, M. Bacchiani, “Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers,” in Proc. of ICASSP, pp. 8512-8516, 2022.
- H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” CoRR abs/2209.02030, 2022.
- K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, pp. 2203-2207, 2022.
- Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder,” CoRR abs/2211.00792, 2022.
- C. Brodbeck, S. Bhattasali, A. Heredia, P. Resnik, J. Simon, E. Lau, “Parallel processing in speech perception with local and global representations of linguistic context,” Elife, doi: 10.7554 /eLife .72056, 2022.
- Y. Tay, D. Bahri, L. Yang, D. Metzler, D. Juan, “Sparse Sinkhorn Attention,” in Proc. of ICML, pp. 9438-9447, 2020.
- M. Sander, P. Ablin, M. Blondel, G. Peyre, “Sinkformers: Transformers with Doubly Stochastic Attention,” in Proc. of AISTATS, pp. 3515-3530, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, pp. 5998-6008, 2017.
- H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, pp. 1-5, 2017.
- https://huggingface.co/
- D. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
- Xugang Lu (42 papers)
- Peng Shen (17 papers)
- Yu Tsao (199 papers)
- Hisashi Kawai (29 papers)