Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advanced Long-Content Speech Recognition With Factorized Neural Transducer (2403.13423v1)

Published 20 Mar 2024 in cs.SD and eess.AS

Abstract: In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure LLM. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
  2. R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  3. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in 2006 Proceedings of International Conference on Machine Learning (ICML), 2006, p. 369–376.
  4. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, Dec. 2017.
  5. T. Hori, S. Watanabe, and J. R. Hershey, “Joint CTC/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017, pp. 518–529.
  6. S. Karita, N. Chen, T. Hayashi, T. Hori et al., “A comparative study on transformer vs rnn in speech applications,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 449–456.
  7. S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  8. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 4960–4964.
  9. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  10. C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar et al., “State-of-the-art speech recognition with sequence-to-sequence models,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, 2017.
  11. X. Gong, Z. Zhou, and Y. Qian, “Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 2618–2622.
  12. X. Gong, Y. Lu, Z. Zhou, and Y. Qian, “Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 1274–1278.
  13. Y. Qian, X. Gong, and H. Huang, “Layer-wise fast adaptation for end-to-end multi-accent speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2842–2853, 2022.
  14. A. Graves, “Sequence transduction with recurrent neural networks,” in 2012 Proceedings of International Conference of Machine Learning (ICML) Workshop on Representation Learning, Nov. 2012.
  15. X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Developing real-time streaming transformer transducer for speech recognition on large-scale dataset,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5904–5908.
  16. T. N. Sainath, Y. He, B. Li, A. Narayanan et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6059–6063, 2020.
  17. E. Battenberg, J. Chen, R. Child, A. Coates et al., “Exploring neural transducers for end-to-end speech recognition,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 206–213, 2017.
  18. J. Li, Y. Wu, Y. Gaur, C. Wang et al., “On the comparison of popular end-to-end models for large scale speech recognition,” in Proc. Interspeech 2020, 2020, pp. 1–5.
  19. J. Li, R. Zhao, Z. Meng, Y. Liu et al., “Developing RNN-T models surpassing high-performance hybrid models with customization capability,” in Proc. Interspeech 2020, 2020, pp. 3590–3594.
  20. S. Kim and F. Metze, “Dialog-Context Aware end-to-end Speech Recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT), Dec. 2018, pp. 434–440.
  21. T. Hori, N. Moritz, C. Hori, and J. Le Roux, “Transformer-Based Long-Context End-to-End Speech Recognition,” in Proc. Interspeech 2020, Oct. 2020, pp. 5011–5015.
  22. R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara et al., “Large Context End-to-end Automatic Speech Recognition via Extension of Hierarchical Recurrent Encoder-decoder Models,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 5661–5665.
  23. R. Masumura, N. Makishima, M. Ihori, A. Takashima, and Tothers, “Hierarchical transformer-based large-context end-to-end asr with large-context knowledge distillation,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5879–5883.
  24. T. Hori, N. Moritz, C. Hori, and J. L. Roux, “Advanced long-context end-to-end speech recognition using context-expanded transformers,” in Proc. Interspeech 2021, 2021, pp. 2097–2101.
  25. K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Training language models for long-span cross-sentence evaluation,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 419–426.
  26. T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model,” in 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, Dec. 2012, pp. 234–239.
  27. Q. H. Tran, I. Zukerman, and G. Haffari, “Inter-document contextual language model,” in 2016 Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016, pp. 762–766.
  28. T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model,” in 2012 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2012, pp. 234–239.
  29. B. Liu and I. Lane, “Dialog context language modeling with recurrent neural networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 5715–5719.
  30. T. Wang and K. Cho, “Larger-context language modelling with recurrent neural network,” in 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 1319–1329.
  31. R. Lin, S. Liu, M. Yang, M. Li et al., “Hierarchical recurrent neural network for document modeling,” in 2015 Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pp. 899–907.
  32. S.-H. Chiu, T.-H. Lo, and B. Chen, “Cross-sentence Neural Language Models for Conversational Speech Recognition,” in 2021 International Joint Conference on Neural Networks (IJCNN), Jul. 2021, pp. 1–7.
  33. W. Xiong, L. Wu, J. Zhang, and A. Stolcke, “Session-level language modeling for conversational speech,” in 2018 Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 2764–2768.
  34. X. Gong, W. Wang, H. Shao, X. Chen, and Y. Qian, “Factorized aed: Factorized attention-based encoder-decoder for text-only domain adaptive asr,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  35. K. Wei, Y. Zhang, S. Sun, L. Xie, and L. Ma, “Conversational speech recognition by learning conversation-level characteristics,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6752–6756.
  36. W. Kun, Z. Yike, S. Sining, X. Lei, and M. Long, “Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr,” in Proc. Interspeech 2021, 2022, pp. 1016–1020.
  37. W. Kun, G. Pengcheng, and J. Ning, “Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism,” in Proc. Interspeech 2022, 2022, pp. 3804–3808.
  38. C.-C. Chiu, W. Han, Y. Zhang, R. Pang et al., “A Comparison of End-to-End Models for Long-Form Speech Recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 889–896.
  39. K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2017, pp. 193–199.
  40. Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw et al., “Streaming end-to-end speech recognition for mobile devices,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385, 2018.
  41. A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach et al., “Recognizing Long-Form Speech Using Streaming End-to-End Models,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 920–927.
  42. A. Schwarz, I. Sklyar, and S. Wiesler, “Improving RNN-T ASR Accuracy Using Context Audio,” in Proc. Interspeech 2021, 2021, pp. 1792–1796.
  43. A. Kojima, “Large-context automatic speech recognition based on rnn transducer,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2021, pp. 460–464.
  44. A. Zeyer, R. Schmitt, W. Zhou, R. Schlüter, and H. Ney, “Monotonic segmental attention for automatic speech recognition,” in 2022 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2023, pp. 229–236.
  45. N. Moritz, T. Hori, and J. Le Roux, “Streaming end-to-end speech recognition with joint ctc-attention based models,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 936–943.
  46. Y. Shi, Y. Wang, C. Wu, C.-F. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer, “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6783–6787.
  47. X. Chen, Z. Meng, S. Parthasarathy, and J. Li, “Factorized neural transducer for efficient language model adaptation,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8132–8136.
  48. R. Zhao, J. Xue, P. Parthasarathy, V. Miljanic, and J. Li, “Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  49. X. Gong, Y. Wu, J. Li, S. Liu et al., “Longfnt: Long-form speech recognition with factorized neural transducer,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  50. E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregressive transducer (hat),” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6139–6143.
  51. A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A new training pipeline for an improved neural transducer,” in Proc. Interspeech 2020, May 2020.
  52. Z. Meng, T. Chen, R. Prabhavalkar, Y. Zhang et al., “Modular hybrid autoregressive transducer,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 197–204.
  53. Y. Liu, M. Ott, N. Goyal, J. Du et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  54. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
  55. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech 2020, May 2020, pp. 5036–5040.
  56. N. Moritz, T. Hori, and J. L. Roux, “Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 1822–1826.
  57. C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in 2018 International Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum?id=Hko85plCW
  58. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  59. G. Chen, S. Chai, G. Wang, J. Du et al., “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” in Proc. Interspeech 2021, 2021, pp. 3670–3674.
  60. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” Proc. Interspeech 2019, pp. 2613–2617, Sep. 2019.
  61. T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in 2018 Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71.
  62. X. Chen, S. Parthasarathy, W. Gale, S. Chang, and M. Zeng, “Lstm-lm with long-term history for first-pass decoding in conversational speech recognition,” arXiv preprint arXiv:2010.11349, 2020.
Citations (6)

Summary

We haven't generated a summary for this paper yet.