- The paper introduces ClovaCall as a domain-specific Korean speech corpus enhancing ASR for restaurant reservation dialogs.
- It comprises 60,000 sentence pairs from over 11,000 call interactions, capturing realistic audio conditions with concise utterances.
- Empirical tests show that fine-tuning ASR models on ClovaCall significantly outperforms general training, confirming robust domain transfer.
Analysis of ClovaCall: A Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition in Contact Centers
The paper introduces ClovaCall, a large-scale Korean speech corpus tailored to the specific needs of goal-oriented dialog systems, particularly within the context of AI-driven contact center services. This dataset distinguishes itself from existing corpora by focusing on a targeted application—restaurant reservation scenarios—thereby addressing the scarcity of domain-specific ASR resources in non-English languages, especially those with fewer linguistic resources available.
Motivation and Background
The paper underscores the burgeoning relevance of automatic speech recognition (ASR) systems in enhancing the efficiency and operation of contact centers. Traditional corpora such as Switchboard and Librispeech, despite their extensive use and historical significance, are restricted by offering predominantly open-domain English speech data, often devoid of the specificity needed for goal-oriented tasks in low-resource languages. ClovaCall fills this niche by providing a dataset aligned with actual conversational interactions typical within restaurant booking contexts, fostering the development of more precise dialog management and task-specific ASR systems.
Dataset Composition
ClovaCall encompasses a dataset of approximately 60,000 sentence-utterance pairs derived from over 11,000 individuals. Notably, the utterances are captured through phone calls, reflecting realistic audio quality challenges encountered in actual contact center environments. The dataset's design focuses on brevity and clarity, with sentences typically not exceeding a 10-second utterance length, thereby facilitating manageable data processing and annotation.
Findings and ASR Model Performance
Through a series of empirical evaluations, the paper illustrates that the application of ClovaCall significantly enhances ASR model performance compared to models trained on general-purpose datasets alone, such as AIHub, which demonstrate poorer recognition accuracy when confronted with goal-oriented dialog tasks. Specifically, models like Deep Speech 2 (DS2) and Listen, Attend and Spell (LAS) exhibit improved performance when fine-tuned with the ClovaCall data, highlighting the importance of task-specific datasets.
Moreover, the work demonstrates that pretraining on a large-scale open-domain dataset followed by finetuning on ClovaCall outperforms from-scratch training or the use of conventional data augmentation techniques. This suggests a hybrid approach, wherein initial broad-spectrum linguistic models are refined with domain-specific data, delivers superior results in specific applications.
Practical Implications and Future Directions
The release and subsequent utilization of ClovaCall represent a significant advancement for Korean language ASR technologies in domain-specific applications. The practical implications are noteworthy: enhancing the accuracy and responsiveness of AI contact centers in the hospitality industry aligns with broader commercial interests in automating and optimizing customer interaction workflows.
Theoretically, this work points towards the necessity of creating tailored datasets across diverse languages and applications, fostering models that are not only linguistically informed but also contextually relevant. Future research may delve into expanding ClovaCall to encompass additional domains beyond restaurant reservations or adapting its methodology to construct comparable corpora in other languages.
In summary, by releasing ClovaCall, the authors contribute a valuable tool to both the academic and commercial fields, advancing domain-specific ASR capabilities in the Korean language and setting a precedent for similar efforts in other areas. This work lays the groundwork for developing intelligent systems that more accurately mirror real-world dialog scenarios, thereby reinforcing the pivotal role of contextually rich datasets in the evolution of speech recognition technology.