Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution (2211.11419v4)

Published 21 Nov 2022 in cs.SD, cs.CL, and eess.AS

Abstract: Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left context and chunk-wise future context. Evaluations on AISHELL-1 show that an End-to-End (E2E) CER 5.33% can achieve, which even outperforms a strong time-restricted baseline U2. Moreover, the chunk-wise MHSA computation in our model enables it to train with a large batch size and perform inference with linear complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition.” in INTERSPEECH.   ISCA, 2020, pp. 5036–5040.
  2. B. Zhang, D. Wu, Z. Yao, X. Wang, and F. Y. et al., “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,” ArXiv, vol. abs/2012.05481, 2020.
  3. A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  4. J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong, “Advancing acoustic-to-word ctc model,” in ICASSP.   IEEE, April 2018.
  5. A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” ICML, 2006.
  6. E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” ASRU, pp. 206–213, 2017.
  7. J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, and S. Liu, “On the comparison of popular end-to-end models for large scale speech recognition,” pp. 2846–2890, 2020.
  8. P. Guo, F. Boyer, X. Chang, T. Hayashi, and Y. H. et al., “Recent developments on espnet toolkit boosted by conformer,” ICASSP, pp. 5874–5878, 2021.
  9. K. An, H. Zheng, Z. Ou, H. Xiang, K. Ding, and G. Wan, “CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR,” pp. 2103–2107, 2022.
  10. N. Moritz, T. Hori, and J. L. Roux, “Dual causal/non-causal self-attention for streaming end-to-end speech recognition,” INTERSPEECH, 2021.
  11. Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous transformers for end-to-end speech recognition,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7884–7888, 2019.
  12. C. Wang, Y. Wu, S. Liu, J. Li, L. Lu, G. Ye, and M. Zhou, “Reducing the latency of end-to-end streaming speech recognition models with a scout network,” pp. 1292–1296, 2020.
  13. X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Developing real-time streaming transformer transducer for speech recognition on large-scale dataset,” ICASSP, pp. 5904–5908, 2021.
  14. M. Li, C. Zorila, and R. Doddipatla, “Head-synchronous decoding for transformer-based streaming asr,” ICASSP, pp. 5909–5913, 2021.
  15. N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” ICASSP, pp. 6074–6078, 2020.
  16. J. Yu, W. Han, A. Gulati, C.-C. Chiu, and B. L. et al., “Universal asr: Unify and improve streaming asr with full-context modeling,” ICLR, 2021.
  17. D. Wu, B. Zhang, C. Yang, Z. Peng, and W. X. et al., “U2++: Unified two-pass bidirectional end-to-end model for speech recognition,” ArXiv, vol. abs/2106.05642, 2021.
  18. Z. Wang and W. Y. et al., “Wnars: Wfst based non-autoregressive streaming end-to-end speech recognition,” ArXiv, vol. abs/2104.03587, 2021.
  19. F. Weninger, M. Gaudesi, M. A. Haidar, N. Ferri, and J. A.-F. et al., “Conformer with dual-mode chunked attention for joint online and offline asr,” INTERSPEECH, 2022.
  20. C. Wu, Y. Wang, Y. Shi, C. feng Yeh, and F. Zhang, “Streaming transformer-based acoustic models using self-attention with augmented memory,” pp. 2079–2083, 2020.
  21. S. Zhang, Z. Gao, H. Luo, M. Lei, J. Gao, Z. Yan, and L. Xie, “Streaming chunk-aware multihead attention for online end-to-end speech recognition,” in Interspeech, 2020.
  22. H. Inaguma, M. Mimura, and T. Kawahara, “Enhancing monotonic multihead attention for streaming asr,” in INTERSPEECH, 2020.
  23. Y. Shi, Y. Wang, and C. W. et al., “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” ICASSP, pp. 6783–6787, 2021.
  24. F. Wang and B. Xu, “Shifted chunk encoder for transformer based streaming end-to-end asr,” in International Conference on Neural Information Processing, 2022.
  25. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, and B. Z. et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” INTERSPEECH, 2019.
  26. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” O-COCOSDA, pp. 1–5, 2017.
  27. B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, “Wenet 2.0: More productive end-to-end speech recognition toolkit,” arXiv preprint arXiv:2203.15455, 2022.
  28. S. Kim, A. Gholami, A. E. Shaw, N. Lee, K. Mangalam, J. Malik, M. W. Mahoney, and K. Keutzer, “Squeezeformer: An efficient transformer for automatic speech recognition,” ArXiv, vol. abs/2206.00888, 2022.

Summary

We haven't generated a summary for this paper yet.