FastInject: Injecting Unpaired Text Data into CTC-based ASR training (2312.09100v1)
Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the development of self-supervised learning. However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint training method, named FastInject, which efficiently injects multi-domain unpaired text data into CTC-based ASR training. To maintain training efficiency, text units are pre-upsampled, and their representations are fed into the CTC model along with speech features. To bridge the modality gap between speech and text, an attention-based modality matching mechanism (AM3) is proposed, which retains the E2E flat-start training. Experiments show that the proposed FastInject gave a 22\% relative WER reduction (WERR) for intra-domain Librispeech-100h data and 20\% relative WERR on out-of-domain test sets.
- A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
- A. Graves, “Sequence transduction with recurrent neural networks,” ArXiv, vol. abs/1211.3711, 2012.
- J. Lee and S. Watanabe, “Intermediate loss regularization for CTC-based speech recognition,” in Proc. ICASSP, 2021.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
- Z. Zhang, S. Chen, L. Zhou, Y. Wu, S. Ren, S. Liu, Z. Yao, X. Gong, L. Dai, J. Li, et al., “SpeechLM: Enhanced speech pre-training with unpaired textual data,” arXiv preprint arXiv:2209.15329, 2022.
- K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, and P. Zhang, “Improving CTC-based speech recognition via knowledge transferring from pre-trained language models,” in Proc. ICASSP, 2022.
- K. Deng and P. C. Woodland, “Adaptable end-to-end ASR models using replaceable internal LMs and residual softmax,” in Proc. ICASSP, 2023.
- K. Deng and P. C. Woodland, “Label-synchronous neural transducer for end-to-end ASR,” arXiv preprint arXiv:2307.03088, 2023.
- N. Kanda, X. Lu, and H. Kawai, “Maximum a posteriori based decoding for CTC acoustic models.,” in Interspeech, 2016.
- N. Das, M. Sunkara, S. Bodapati, J. Cai, D. Kulshreshtha, J. Farris, and K. Kirchhoff, “Mask the bias: Improving domain-adaptive generalization of CTC-based ASR with internal language model estimation,” in Proc. ICASSP, 2023.
- T. N. Sainath, R. Prabhavalkar, A. Bapna, Y. Zhang, Z. Huo, Z. Chen, B. Li, W. Wang, and T. Strohman, “JOIST: A joint speech and text streaming model for ASR,” in Proc. SLT, 2023.
- J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Proc. NeurIPS, 2015.
- E. McDermott, H. Sak, and E. Variani, “A density ratio approach to language model fusion in end-to-end automatic speech recognition,” in Proc. ASRU, 2019.
- E. Tsunoo, Y. Kashiwagi, C. P. Narisetty, and S. Watanabe, “Residual language model for end-to-end speech recognition,” in Proc. Interspeech, 2022.
- Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, P. J. Moreno, A. Bapna, and H. Zen, “Maestro: Matched speech text representations through modality matching,” in Proc. Interspeech, 2022.
- Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, G. Wang, and P. Moreno, “Injecting text in self-supervised speech pretraining,” in Proc. ASRU, 2021.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
- G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z. You, and Z. Yan, “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” in Proc. Interspeech, 2021.
- A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in Proc. LREC, 2014.
- X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie, “The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods,” in Proc. ICASSP, 2021.
- J. Ao, R. Wang, L. Zhou, S. Liu, S. Ren, Y. Wu, T. Ko, Q. Li, Y. Zhang, Z. Wei, Y. Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proc. ACL, 2021.
- A. Bapna, Y.-a. Chung, N. Wu, A. Gulati, Y. Jia, J. H. Clark, M. Johnson, J. Riesa, A. Conneau, and Y. Zhang, “SLAM: A unified encoder for speech and language modeling via speech-text joint pre-training,” ArXiv, vol. abs/2110.10329, 2021.
- J. Bai, B. Li, Y. Zhang, A. Bapna, N. Siddhartha, K. C. Sim, and T. N. Sainath, “Joint unsupervised and supervised training for multilingual ASR,” in Proc. ICASSP, 2022.
- P. Wang, T. N. Sainath, and R. J. Weiss, “Multitask training with text data for end-to-end speech recognition,” in Proc. Interspeech, 2021.
- T. N. Sainath, R. Pang, R. J. Weiss, Y. He, C.-c. Chiu, and T. Strohman, “An attention-based joint acoustic and text on-device end-to-end model,” in Proc. ICASSP, 2020.
- C. Peyser, Z. Meng, R. Prabhavalkar, A. Rosenberg, T. Sainath, M. Picheny, K. Cho, and K. Hu, “Improving joint speech-text representations without alignment,” in Proc. Interspeech, 2023.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018.
- S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee, “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Sig. Process., vol. 16, pp. 1505–1518, 2021.
- A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsupervised speech recognition,” in Proc. NeurIPS, 2021.
- D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
- X. Yue, J. Ao, X. Gao, and H. Li, “Token2vec: a joint self-supervised pre-training framework using unpaired speech and text,” in Proc. ICASSP, 2023.
- D. Pallet, W. Fisher, and J. Fiscus, “Tools for the analysis of benchmark speech recognition tests,” in Proc. ICASSP, 1990.
- Keqi Deng (18 papers)
- Philip C. Woodland (50 papers)