Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving the fusion of acoustic and text representations in RNN-T (2201.10240v1)

Published 25 Jan 2022 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction network based on the previous subword units. In this paper, we propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations to feed into the output layer. A regularisation method is also proposed to enable better acoustic encoder training by reducing the gradients back-propagated into the prediction network at the beginning of RNN-T training. Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chao Zhang (907 papers)
  2. Bo Li (1107 papers)
  3. Zhiyun Lu (19 papers)
  4. Tara N. Sainath (79 papers)
  5. Shuo-yiin Chang (25 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.