Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Korean Tokenization for Beam Search Rescoring in Speech Recognition (2203.03583v2)

Published 22 Feb 2022 in cs.CL, cs.SD, and eess.AS

Abstract: The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external LLM (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although the common approach is to use the same tokenization method for external LM as the ASR model, we show that it may not be the best choice for Korean. We propose a new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable. By utilizing the proposed SkipTC token, the input sequence for LM becomes very regularly patterned so that the LM can better learn the linguistic characteristics. Our experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC. In addition, we are the first to report the ASR performance for the recently introduced large-scale 7,600h Korean speech dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kyuhong Shim (26 papers)
  2. Hyewon Bae (2 papers)
  3. Wonyong Sung (33 papers)

Summary

We haven't generated a summary for this paper yet.