A spelling correction model for end-to-end speech recognition (1902.07178v1)

Published 19 Feb 2019 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, LLM (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the LLM component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (3)

Jinxi Guo (15 papers)
Tara N. Sainath (79 papers)
Ron J. Weiss (30 papers)

Citations (135)

View on Semantic Scholar

A spelling correction model for end-to-end speech recognition (1902.07178v1)

Related Papers