Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Evaluation of real-time transcriptions using end-to-end ASR models (2409.05674v2)

Published 9 Sep 2024 in cs.SD, cs.AI, and cs.CL

Abstract: Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.

Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube