Multi-encoder multi-resolution framework for end-to-end speech recognition (1811.04897v1)

Published 12 Nov 2018 in cs.CL

Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model. Two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary acoustic information. A hierarchical attention mechanism is then used to combine the encoder-level information. To demonstrate the effectiveness of the proposed model, experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6% WER in the WSJ eval92 test set, which is the best WER reported for an end-to-end system on this benchmark.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Ruizhi Li (9 papers)
Xiaofei Wang (138 papers)
Sri Harish Mallidi (7 papers)
Takaaki Hori (41 papers)
Shinji Watanabe (416 papers)
Hynek Hermansky (15 papers)

Citations (13)

View on Semantic Scholar

Multi-encoder multi-resolution framework for end-to-end speech recognition (1811.04897v1)

Related Papers