Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR (2409.02239v2)

Published 3 Sep 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Transferring linguistic knowledge from a pretrained LLM (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
  2. W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, 2016, pp. 4960-4964.
  3. S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, 2017, pp. 4835–4839.
  4. T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proc. of ACL, 2017, vol. 1, pp. 518–529.
  5. S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
  6. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint, arXiv:1211.3711, 2012.
  7. J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proc. of ACML, 2019, pp. 1081-1093.
  8. J. Salazar, D. Liang, T. Nguyen, K. Kirchhoff, “Masked Language Model Scoring,” in Proc. of ACL, 2020, pp. 2699-2712.
  9. A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–1772.
  10. Y. Fujita, T. Komatsu, and Y. Kida, “Multi-sequence intermediate conditioning for ctc-based asr,” arXiv preprint, arXiv:2204.00175, 2022.
  11. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.
  12. F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
  13. Y. Kubo, S. Karita, M. Bacchiani, “Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers,” in Proc. of ICASSP, 2022, pp. 8512-8516.
  14. K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, 2022, pp. 2203-2207.
  15. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
  16. H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” CoRR abs/2209.02030, 2022.
  17. Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-Based End-To-End ASR with Bert-Enhanced Encoder,” in Proc. of ICASSP, 2023, pp. 1-5.
  18. M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint, arXiv:2301.13003, 2023.
  19. K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, 2022, pp. 60-67.
  20. K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, 2022, pp. 8517-8521.
  21. K. Deng, Z. Yang, S. Watanabe, Y. Higuchi, G. Cheng, P. Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models,” in Proc. of ICASSP, 2022, pp. 8522-8526.
  22. K. Deng, Z, P. Woodland, “FastInject: Injecting Unpaired Text Data into CTC-Based ASR Training,” in Proc. of ICASSP, 2024, pp. 11836-11840.
  23. M. Han, L. Dong, Z. Liang, M. Cai, S. Zhou, Z. Ma, B. Xu, “Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection,” in Proc. of ICASSP, 2022, pp. 8532-8536.
  24. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, 2017, pp. 5998-6008.
  25. N. Courty, R. Flamary, A. Habrard, A. Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Proc. of NIPS, 2017, pp. 3733-3742.
  26. X. Lu, P. Shen, Y. Tsao, H. Kawai, “Unsupervised Neural Adaptation Model Based on Optimal Transport for Spoken Language Identification,” in Proc. of ICASSP, 2021, pp. 7213-7217.
  27. H. Lin, H. Tseng, X. Lu, Y. Tsao, “Unsupervised Noise Adaptive Speech Enhancement by Discriminator-Constrained Optimal Transport,” in Proc. of NeurIPS, 2021, pp. 19935-19946.
  28. H. Tseng, H. Lin, H. Hsuan, and Y. Tsao, “Interpretations of Domain Adaptations via Layer Variational Analysis,” arXiv preprint, CoRR abs/2302.01798, 2023.
  29. W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, 2020, pp. 896-900.
  30. W. Wang, S. Ren, Y. Qian, S. Liu, Y. Shi, Y. Qian, M. Zeng, “Optimizing Alignment of Speech and Language Latent Spaces for End-To-End Speech Recognition and Understanding,” in Proc. of ICASSP, 2021, pp. 7802-7806.
  31. Y. Zhou, Q. Fang, Y. Feng, “CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation,” arXiv preprint, arXiv:2305.14635, 2023.
  32. P. Le, H. Gong, C. Wang, J. Pino, B. Lecouteux, D. Schwab, “Pre-training for Speech Translation: CTC Meets Optimal Transport,” arXiv preprint, CoRR abs/2301.11716, 2023.
  33. X. Lu, P. Shen, Y. Tsao, H. Kawai, “Cross-Modal Alignment with Optimal Transport for CTC-Based ASR,” IEEE-ASRU, 2023, Dec.16-20,Taipei, Taiwan.
  34. M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Proc. of NIPS, 2013, vol. 26.
  35. B. Su, G. Hua, “Order-Preserving Wasserstein Distance for Sequence Matching,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  36. Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, 2017, pp. 1-5.
  37. https://huggingface.co/
  38. Diederik P. Kingma, Jimmy Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
  39. B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, J. Niu, “WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit,” in Proc. of INTERSPEECH, 2022, pp. 1661-1665.

Summary

We haven't generated a summary for this paper yet.