Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation (2102.07935v1)

Published 16 Feb 2021 in cs.CL and cs.LG

Abstract: We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context LLM in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Ryo Masumura (28 papers)
Naoki Makishima (17 papers)
Mana Ihori (16 papers)
Akihiko Takashima (16 papers)
Tomohiro Tanaka (37 papers)
Shota Orihashi (13 papers)

Citations (29)

View on Semantic Scholar

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation (2102.07935v1)

Related Papers