Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation (2306.15171v1)

Published 27 Jun 2023 in cs.CL

Abstract: Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19\% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Alex Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012.
  2. “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 2020, pp. 6059–6063, IEEE.
  3. “Conformer: Convolution-augmented transformer for speech recognition,” pp. 5036–5040, 2020.
  4. “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 2020, pp. 7829–7833, IEEE.
  5. “Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 2117–2121, ISCA.
  6. Atsushi Kojima, “Knowledge distillation for streaming transformer-transducer,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 2841–2845, ISCA.
  7. “Tinybert: Distilling BERT for natural language understanding,” vol. EMNLP 2020, pp. 4163–4174, 2020.
  8. “Fitnets: Hints for thin deep nets,” 2015.
  9. “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015.
  10. “Parameter-efficient and student-friendly knowledge distillation,” CoRR, vol. abs/2205.15308, 2022.
  11. “Transformer-transducer: End-to-end speech recognition with self-attention,” CoRR, vol. abs/1910.12977, 2019.
  12. “Conformer-based hybrid ASR system for switchboard dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 7437–7441, IEEE.
  13. “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” 2018, vol. abs/1801.00841.
  14. “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013. 2013, pp. 6645–6649, IEEE.
  15. “Developing real-time streaming transformer transducer for speech recognition on large-scale dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 5904–5908, IEEE.
  16. “What does a network layer hear? analyzing hidden representations of end-to-end ASR through speech synthesis,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 2020, pp. 6434–6438, IEEE.
  17. “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. 2015, pp. 5206–5210, IEEE.
  18. “The USTC-NELSLIP offline speech translation systems for IWSLT 2022,” in Proceedings of the 19th International Conference on Spoken Language Translation, IWSLT@ACL 2022, Dublin, Ireland (in-person and online), May 26-27, 2022, Elizabeth Salesky, Marcello Federico, and Marta Costa-jussà, Eds. 2022, pp. 198–207, Association for Computational Linguistics.
  19. “Efficient knowledge distillation for rnn-transducer models,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 5639–5643, IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Haitao Tang (6 papers)
  2. Yu Fu (86 papers)
  3. Lei Sun (138 papers)
  4. Jiabin Xue (3 papers)
  5. Dan Liu (74 papers)
  6. Yongchao Li (7 papers)
  7. Zhiqiang Ma (19 papers)
  8. Minghui Wu (21 papers)
  9. Jia Pan (127 papers)
  10. Genshun Wan (10 papers)
  11. Ming'en Zhao (1 paper)
Citations (1)

Summary

We haven't generated a summary for this paper yet.