Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do End-to-End Speech Recognition Models Care About Context? (2102.09928v1)

Published 17 Feb 2021 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit LLM. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external LLM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lasse Borgholt (11 papers)
  2. Jakob Drachmann Havtorn (5 papers)
  3. Željko Agić (7 papers)
  4. Anders Søgaard (120 papers)
  5. Lars Maaløe (23 papers)
  6. Christian Igel (47 papers)
Citations (7)