KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding (2004.03289v3)

Published 7 Apr 2020 in cs.CL

Abstract: Natural language inference (NLI) and semantic textual similarity (STS) are key tasks in natural language understanding (NLU). Although several benchmark datasets for those tasks have been released in English and a few other languages, there are no publicly available NLI or STS datasets in the Korean language. Motivated by this, we construct and release new datasets for Korean NLI and STS, dubbed KorNLI and KorSTS, respectively. Following previous approaches, we machine-translate existing English training sets and manually translate development and test sets into Korean. To accelerate research on Korean NLU, we also establish baselines on KorNLI and KorSTS. Our datasets are publicly available at https://github.com/kakaobrain/KorNLUDatasets.

PDF Abstract

Examining KorNLI and KorSTS: Benchmark Datasets for Korean Language Understanding

The paper presents significant contributions to the landscape of natural language understanding (NLU) in the Korean language by introducing two benchmark datasets: KorNLI (Korean Natural Language Inference) and KorSTS (Korean Semantic Textual Similarity). These datasets address the notable gap in Korean NLU resources, particularly in areas beyond traditional applications like question answering and sentiment analysis. This work fills a crucial void by enabling researchers to develop and evaluate models specifically aimed at understanding Korean semantics.

Methodology

The construction of KorNLI and KorSTS involves a combination of machine and human translation techniques. The authors employed machine translation for English training datasets and manual translation for development and test sets, ensuring higher quality and consistency in evaluation data. This approach mirrors previous methodologies utilized in creating multilingual NLU datasets, raising the standard for translation accuracy and dataset reliability.

For KorNLI, the dataset was derived by translating instances from established datasets such as SNLI, MNLI, and XNLI. KorSTS was similarly developed from the STS-B dataset. Human experts post-edited the machine-translated outputs for development and test datasets, enhancing the semantic correctness and preserving the intent of original sentences. This two-tier translation process, combining neural machine translation and expert human oversight, reassures the integrity of the translated datasets.

Benchmarks and Baselines

To facilitate further research, the authors established strong baselines using both cross-encoding and bi-encoding techniques. The cross-encoding approach involved fine-tuning large pre-trained LLMs like Korean RoBERTa and XLM-R for both KorNLI and KorSTS tasks. These models demonstrated competitive performance, particularly the large version of Korean RoBERTa, which obtained the highest scores across both datasets when fine-tuned on the Korean data.

For bi-encoding approaches, the authors leveraged SentenceBERT-style architectures to establish baselines, capitalizing on the practical importance of this setup for tasks demanding efficient computation, such as semantic search. Again, the integration of KorNLI into the training pipeline proved beneficial, corroborating the dataset's role as a valuable intermediary learning resource.

Numerical Results and Evaluation

The paper reports encouraging results with fine-tuned models achieving significant accuracy and correlation metrics on the test sets: 83.67% accuracy for KorNLI (using Korean RoBERTa large) and 85.27% Spearman correlation for KorSTS. These results suggest that the datasets and methodologies employed are capable of supporting robust model training and evaluation, reaffirming KorNLI and KorSTS as reliable benchmarks for Korean NLU.

Implications and Future Work

The development of KorNLI and KorSTS facilitates the advancement of LLMs tailored to Korean language understanding, providing not only immediate resources for model evaluation but also setting a standard for future dataset development in less-represented languages. The translation and post-editing strategy employed serves as an exemplar for developing high-quality linguistic resources across diverse languages.

Future research could explore enhancing the datasets, perhaps through expanding their size or incorporating additional nuanced linguistic features. Moreover, translating these benchmarks to leverage cross-lingual insights could further improve the accuracy and generalization of NLU models across languages.

In conclusion, KorNLI and KorSTS are critical steps toward broadening the scope of NLU research beyond English-centric resources, offering substantial contributions to the diversification of language processing technologies. As NLU research continues to progress, these datasets will support both the theoretical advancement and practical deployment of models capable of understanding the Korean language.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jiyeon Ham (7 papers)
Yo Joong Choe (13 papers)
Kyubyong Park (12 papers)
Ilji Choi (1 paper)
Hyungjoon Soh (9 papers)

Citations (72)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - kakaobrain/kor-nlu-datasets: KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding (298 stars)