Transformer Based Deliberation for Two-Pass Speech Recognition (2101.11577v1)

Published 27 Jan 2021 in cs.CL

Abstract: Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the "encoder-decoder" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Ke Hu (57 papers)
Ruoming Pang (59 papers)
Tara N. Sainath (79 papers)
Trevor Strohman (38 papers)

Citations (37)

View on Semantic Scholar

Transformer Based Deliberation for Two-Pass Speech Recognition (2101.11577v1)

Related Papers