Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models (2005.09336v3)

Published 19 May 2020 in eess.AS, cs.CL, cs.LG, and cs.NE

Abstract: Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mohammad Zeineldeen (16 papers)
  2. Albert Zeyer (20 papers)
  3. Wei Zhou (311 papers)
  4. Thomas Ng (9 papers)
  5. Ralf Schlüter (73 papers)
  6. Hermann Ney (104 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.