Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of Methods for OOV-word Recognition on a New Public Dataset (2107.08091v1)

Published 16 Jul 2021 in cs.CL, cs.SD, and eess.AS

Abstract: A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a new tool for calculating relevant performance metrics. We then evaluate, within the context of a hybrid ASR system, how much better subword models are at recognizing OOVs, and how much benefit one can get from incorporating OOV-word information into an existing system by modifying WFSTs. Additionally, we propose a new method for modifying a subword-based LLM so as to better recognize OOV-words. We showcase very large improvements in OOV-word recognition and make both the data and code available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rudolf A. Braun (1 paper)
  2. Srikanth Madikeri (19 papers)
  3. Petr Motlicek (40 papers)
Citations (6)