Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese (1804.10752v2)

Published 28 Apr 2018 in eess.AS, cs.CL, and cs.SD

Abstract: Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and LLM into a single neural network. In these models, the Transformer, a new sequence-to-sequence attention-based model relying entirely on self-attention without using RNNs or convolutions, achieves a new single-model state-of-the-art BLEU on neural machine translation (NMT) tasks. Since the outstanding performance of the Transformer, we extend it to speech and concentrate on it as the basic architecture of sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese. Additionally, a greedy cascading decoder with the Transformer is proposed for mapping CI-phoneme sequences and syllable sequences into word sequences. Experiments on HKUST datasets demonstrate that syllable based model with the Transformer performs better than CI-phoneme based counterpart, and achieves a character error rate (CER) of \emph{$28.77\%$}, which is competitive to the state-of-the-art CER of $28.0\%$ by the joint CTC-attention based encoder-decoder network.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Shiyu Zhou (32 papers)
Linhao Dong (16 papers)
Shuang Xu (59 papers)
Bo Xu (212 papers)

Citations (115)

View on Semantic Scholar

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese (1804.10752v2)

Related Papers