Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean (2403.06412v4)

Published 11 Mar 2024 in cs.CL
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

Abstract: Despite the rapid development of LLMs for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 LLMs to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.

Exploring Cultural and Linguistic Intelligence in Korean with CLIcK: A Comprehensive Benchmark Dataset

Introduction to CLIcK

The progression of LLMs, especially in languages other than English, has been a focal point in computational linguistics. However, the development of such models in the Korean language faces a significant roadblock: the dearth of comprehensive benchmark datasets that immerse these models in the cultural and linguistic intricacies of Korean. The Cultural and Linguistic Intelligence in Korean (CLIcK) benchmark dataset aims to bridge this gap, presenting a pioneering collection of 1,995 QA pairs meticulously drawn from official Korean exams and textbooks across eleven diverse categories.

Motivation for CLIcK

Korean language evaluation tasks have so far either been overly simplistic or derivative of English benchmarks, inadequately representing Korean cultural and linguistic uniqueness. Despite some datasets that touch upon Korean cultural aspects, the narrow focus of these resources on tasks such as bias and hate speech detection precludes a holistic assessment of LLMs' (LLMs') cultural and linguistic understanding. CLIcK fills this void by offering a culturally rich and linguistically diverse set of tasks directly sourced from native Korean educational materials.

Dataset Construction and Composition

CLIcK's construction involved selecting questions from standardized Korean exams and using GPT-4 to generate new questions from Korean textbooks. To ensure the questions' relevance and accuracy, a multi-stage validation process with native Korean speakers was employed. Categories under which the data falls include not just traditional linguistic aspects like grammar but also wide-ranging cultural elements from politics to pop culture, offering an expansive view of Korean society and language. The result is a dataset partitioned into two main categories: Cultural Intelligence and Linguistic Intelligence, with the former spanning eight subcategories and the latter three.

Evaluation with LLMs

A comprehensive evaluation of thirteen LLMs spanning various sizes and configurations on CLIcK yielded intriguing insights. Despite some open-source LLMs showing competencies across certain categories, overall performance was modest, with proprietary models such as GPT-3.5 and Claude-2 demonstrating more robust, albeit still imperfect, capabilities. These results underscore the persistent challenge in imbuing LLMs with deep cultural and linguistic understanding, particularly for a language as rich and complex as Korean.

Implications and Future Directions

CLIcK not only spotlights the current limitations of LLMs in grasping the nuances of Korean culture and language but also sets a precedent for constructing similar benchmarks for other underrepresented languages. This initiative beckons the need for tailored model training that places a stronger emphasis on capturing the idiosyncratic elements of individual languages and cultures. As the field moves forward, it is vital to remember that mastery over a language extends beyond mere syntactical proficiency, embracing the cultural lore and linguistic subtleties that give a language its life.

Conclusion

The introduction of CLIcK opens new avenues in the evaluation of Korean LLMs, pushing the envelope on understanding and generating Korean text in a manner that is culturally and linguistically authentic. As researchers and technologists engage with this novel dataset, the hope is for a gradual uplift in the performance of these models, making strides toward genuinely comprehensive linguistic intelligence. By honing in on the cultural and linguistic elements that define Korean, CLIcK paves the way for more nuanced and sophisticated AI models capable of navigating the complexities of language in its full vivacity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. The falcon series of open language models.
  2. L.F. Bachman and A.S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests. Language applied linguistic. OUP Oxford.
  3. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
  4. A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30:121–204.
  5. Hate speech classifiers are culturally insensitive. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 35–46, Dubrovnik, Croatia. Association for Computational Linguistics.
  6. Hugo Liu and Push Singh. 2004. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22.
  7. EnCBP: A new benchmark dataset for finer-grained cultural background prediction in English. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2811–2823, Dublin, Ireland. Association for Computational Linguistics.
  8. Research community dynamics behind popular ai benchmarks. Nature Machine Intelligence, 3(7):581–589.
  9. Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423.
  10. Understanding the capabilities, limitations, and societal impact of large language models.
  11. Commonsense knowledge in machine intelligence. SIGMOD Rec., 46(4):49–52.
  12. Flask: Fine-grained language model evaluation based on alignment skill sets.
  13. Neural question generation from text: A preliminary study. ArXiv, abs/1704.01792.
  14. Serengeti: Massively multilingual language models for africa.
  15. Mega: Multilingual evaluation of generative ai.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  17. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  18. Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages.
  19. KorNLI and KorSTS: New benchmark datasets for Korean natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 422–430, Online. Association for Computational Linguistics.
  20. Kornli and korsts: New benchmark datasets for korean natural language understanding.
  21. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
  22. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
  23. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
  24. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  25. KoBEST: Korean balanced evaluation of significant tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3697–3708, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  26. KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Kobbq: Korean bias benchmark for question answering.
  28. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
  29. Korquad1.0: Korean qa dataset for machine reading comprehension.
  30. BEEP! Korean corpus of online news comments for toxic speech detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, pages 25–31, Online. Association for Computational Linguistics.
  31. Klue: Korean language understanding evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
  32. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland. Association for Computational Linguistics.
  33. A dog is passing over the jet? a text-generation dataset for Korean commonsense reasoning and evaluation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2233–2249, Seattle, United States. Association for Computational Linguistics.
  34. Hae-rae bench: Evaluation of korean knowledge in language models.
  35. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  36. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  37. Superglue: A stickier benchmark for general-purpose language understanding systems.
  38. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  39. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
  40. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  41. Agieval: A human-centric benchmark for evaluating foundation models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Eunsu Kim (14 papers)
  2. Juyoung Suk (7 papers)
  3. Philhoon Oh (5 papers)
  4. Haneul Yoo (21 papers)
  5. James Thorne (48 papers)
  6. Alice Oh (81 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com