A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models (2005.09336v3)

Published 19 May 2020 in eess.AS, cs.CL, cs.LG, and cs.NE

Abstract: Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.

Authors (6)

Mohammad Zeineldeen (16 papers)
Albert Zeyer (20 papers)
Wei Zhou (311 papers)
Thomas Ng (9 papers)
Ralf Schlüter (73 papers)
Hermann Ney (104 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models (2005.09336v3)

Summary

Related Papers