Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastInject: Injecting Unpaired Text Data into CTC-based ASR training (2312.09100v1)

Published 14 Dec 2023 in eess.AS and cs.SD

Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the development of self-supervised learning. However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint training method, named FastInject, which efficiently injects multi-domain unpaired text data into CTC-based ASR training. To maintain training efficiency, text units are pre-upsampled, and their representations are fed into the CTC model along with speech features. To bridge the modality gap between speech and text, an attention-based modality matching mechanism (AM3) is proposed, which retains the E2E flat-start training. Experiments show that the proposed FastInject gave a 22\% relative WER reduction (WERR) for intra-domain Librispeech-100h data and 20\% relative WERR on out-of-domain test sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
  2. A. Graves, “Sequence transduction with recurrent neural networks,” ArXiv, vol. abs/1211.3711, 2012.
  3. J. Lee and S. Watanabe, “Intermediate loss regularization for CTC-based speech recognition,” in Proc. ICASSP, 2021.
  4. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
  5. Z. Zhang, S. Chen, L. Zhou, Y. Wu, S. Ren, S. Liu, Z. Yao, X. Gong, L. Dai, J. Li, et al., “SpeechLM: Enhanced speech pre-training with unpaired textual data,” arXiv preprint arXiv:2209.15329, 2022.
  6. K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, and P. Zhang, “Improving CTC-based speech recognition via knowledge transferring from pre-trained language models,” in Proc. ICASSP, 2022.
  7. K. Deng and P. C. Woodland, “Adaptable end-to-end ASR models using replaceable internal LMs and residual softmax,” in Proc. ICASSP, 2023.
  8. K. Deng and P. C. Woodland, “Label-synchronous neural transducer for end-to-end ASR,” arXiv preprint arXiv:2307.03088, 2023.
  9. N. Kanda, X. Lu, and H. Kawai, “Maximum a posteriori based decoding for CTC acoustic models.,” in Interspeech, 2016.
  10. N. Das, M. Sunkara, S. Bodapati, J. Cai, D. Kulshreshtha, J. Farris, and K. Kirchhoff, “Mask the bias: Improving domain-adaptive generalization of CTC-based ASR with internal language model estimation,” in Proc. ICASSP, 2023.
  11. T. N. Sainath, R. Prabhavalkar, A. Bapna, Y. Zhang, Z. Huo, Z. Chen, B. Li, W. Wang, and T. Strohman, “JOIST: A joint speech and text streaming model for ASR,” in Proc. SLT, 2023.
  12. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Proc. NeurIPS, 2015.
  13. E. McDermott, H. Sak, and E. Variani, “A density ratio approach to language model fusion in end-to-end automatic speech recognition,” in Proc. ASRU, 2019.
  14. E. Tsunoo, Y. Kashiwagi, C. P. Narisetty, and S. Watanabe, “Residual language model for end-to-end speech recognition,” in Proc. Interspeech, 2022.
  15. Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, P. J. Moreno, A. Bapna, and H. Zen, “Maestro: Matched speech text representations through modality matching,” in Proc. Interspeech, 2022.
  16. Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, G. Wang, and P. Moreno, “Injecting text in self-supervised speech pretraining,” in Proc. ASRU, 2021.
  17. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  18. G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z. You, and Z. Yan, “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” in Proc. Interspeech, 2021.
  19. A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in Proc. LREC, 2014.
  20. X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie, “The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods,” in Proc. ICASSP, 2021.
  21. J. Ao, R. Wang, L. Zhou, S. Liu, S. Ren, Y. Wu, T. Ko, Q. Li, Y. Zhang, Z. Wei, Y. Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proc. ACL, 2021.
  22. A. Bapna, Y.-a. Chung, N. Wu, A. Gulati, Y. Jia, J. H. Clark, M. Johnson, J. Riesa, A. Conneau, and Y. Zhang, “SLAM: A unified encoder for speech and language modeling via speech-text joint pre-training,” ArXiv, vol. abs/2110.10329, 2021.
  23. J. Bai, B. Li, Y. Zhang, A. Bapna, N. Siddhartha, K. C. Sim, and T. N. Sainath, “Joint unsupervised and supervised training for multilingual ASR,” in Proc. ICASSP, 2022.
  24. P. Wang, T. N. Sainath, and R. J. Weiss, “Multitask training with text data for end-to-end speech recognition,” in Proc. Interspeech, 2021.
  25. T. N. Sainath, R. Pang, R. J. Weiss, Y. He, C.-c. Chiu, and T. Strohman, “An attention-based joint acoustic and text on-device end-to-end model,” in Proc. ICASSP, 2020.
  26. C. Peyser, Z. Meng, R. Prabhavalkar, A. Rosenberg, T. Sainath, M. Picheny, K. Cho, and K. Hu, “Improving joint speech-text representations without alignment,” in Proc. Interspeech, 2023.
  27. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018.
  28. S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee, “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech.
  29. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Sig. Process., vol. 16, pp. 1505–1518, 2021.
  30. A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsupervised speech recognition,” in Proc. NeurIPS, 2021.
  31. D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
  32. X. Yue, J. Ao, X. Gao, and H. Li, “Token2vec: a joint self-supervised pre-training framework using unpaired speech and text,” in Proc. ICASSP, 2023.
  33. D. Pallet, W. Fisher, and J. Fiscus, “Tools for the analysis of benchmark speech recognition tests,” in Proc. ICASSP, 1990.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Keqi Deng (18 papers)
  2. Philip C. Woodland (50 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.