ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers (2004.09367v2)

Published 20 Apr 2020 in cs.LG, cs.CL, cs.SD, and stat.ML

Abstract: Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available call-based speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open domain dialog or general scenarios such as audiobooks. Here we introduce a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people, i.e., ClovaCall corpus. ClovaCall includes approximately 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain. We validate the effectiveness of our dataset with intensive experiments using two standard ASR models. Furthermore, we release our ClovaCall dataset and baseline source codes to be available via https://github.com/ClovaAI/ClovaCall.

Citations (26)

View on Semantic Scholar

Summary

The paper introduces ClovaCall as a domain-specific Korean speech corpus enhancing ASR for restaurant reservation dialogs.
It comprises 60,000 sentence pairs from over 11,000 call interactions, capturing realistic audio conditions with concise utterances.
Empirical tests show that fine-tuning ASR models on ClovaCall significantly outperforms general training, confirming robust domain transfer.

Analysis of ClovaCall: A Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition in Contact Centers

The paper introduces ClovaCall, a large-scale Korean speech corpus tailored to the specific needs of goal-oriented dialog systems, particularly within the context of AI-driven contact center services. This dataset distinguishes itself from existing corpora by focusing on a targeted application—restaurant reservation scenarios—thereby addressing the scarcity of domain-specific ASR resources in non-English languages, especially those with fewer linguistic resources available.

Motivation and Background

The paper underscores the burgeoning relevance of automatic speech recognition (ASR) systems in enhancing the efficiency and operation of contact centers. Traditional corpora such as Switchboard and Librispeech, despite their extensive use and historical significance, are restricted by offering predominantly open-domain English speech data, often devoid of the specificity needed for goal-oriented tasks in low-resource languages. ClovaCall fills this niche by providing a dataset aligned with actual conversational interactions typical within restaurant booking contexts, fostering the development of more precise dialog management and task-specific ASR systems.

Dataset Composition

ClovaCall encompasses a dataset of approximately 60,000 sentence-utterance pairs derived from over 11,000 individuals. Notably, the utterances are captured through phone calls, reflecting realistic audio quality challenges encountered in actual contact center environments. The dataset's design focuses on brevity and clarity, with sentences typically not exceeding a 10-second utterance length, thereby facilitating manageable data processing and annotation.

Findings and ASR Model Performance

Through a series of empirical evaluations, the paper illustrates that the application of ClovaCall significantly enhances ASR model performance compared to models trained on general-purpose datasets alone, such as AIHub, which demonstrate poorer recognition accuracy when confronted with goal-oriented dialog tasks. Specifically, models like Deep Speech 2 (DS2) and Listen, Attend and Spell (LAS) exhibit improved performance when fine-tuned with the ClovaCall data, highlighting the importance of task-specific datasets.

Moreover, the work demonstrates that pretraining on a large-scale open-domain dataset followed by finetuning on ClovaCall outperforms from-scratch training or the use of conventional data augmentation techniques. This suggests a hybrid approach, wherein initial broad-spectrum linguistic models are refined with domain-specific data, delivers superior results in specific applications.

Practical Implications and Future Directions

The release and subsequent utilization of ClovaCall represent a significant advancement for Korean language ASR technologies in domain-specific applications. The practical implications are noteworthy: enhancing the accuracy and responsiveness of AI contact centers in the hospitality industry aligns with broader commercial interests in automating and optimizing customer interaction workflows.

Theoretically, this work points towards the necessity of creating tailored datasets across diverse languages and applications, fostering models that are not only linguistically informed but also contextually relevant. Future research may delve into expanding ClovaCall to encompass additional domains beyond restaurant reservations or adapting its methodology to construct comparable corpora in other languages.

In summary, by releasing ClovaCall, the authors contribute a valuable tool to both the academic and commercial fields, advancing domain-specific ASR capabilities in the Korean language and setting a precedent for similar efforts in other areas. This work lays the groundwork for developing intelligent systems that more accurately mirror real-world dialog scenarios, thereby reinforcing the pivotal role of contextually rich datasets in the evolution of speech recognition technology.

PDF Markdown

Related Papers

GitHub

GitHub - clovaai/ClovaCall: ClovaCall dataset and Pytorch LAS baseline code (Interspeech 2020) (218 stars)

Tweets

https://twitter.com/khia_johnson/status/1320687189529772032

https://twitter.com/Quantum_Stat/status/1413713598346629122