One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition (2010.14791v3)
Abstract: The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model. During inference, the CTC decoder first generates many candidates in a streaming fashion. Then the transformer decoder selects the best candidate based on the corresponding acoustic encoded states. The second-stage transformer decoder can be regarded as a conditional LLM. We assume that a large enough number and enough diversity of candidates generated in the first stage can compensate the CTC model for the lack of LLMing ability. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. The results show that our proposed model can implement streaming decoding in a fast and straightforward way. Our model can achieve up to a 20% reduction in the character error rate than the baseline CTC model. In addition, our model can also perform non-streaming inference with only a little performance degradation.
- Zhengkun Tian (24 papers)
- Jiangyan Yi (77 papers)
- Ye Bai (28 papers)
- Jianhua Tao (139 papers)
- Shuai Zhang (319 papers)
- Zhengqi Wen (69 papers)