HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models (2309.02706v5)
Abstract: LLMs trained on massive corpora demonstrate impressive capabilities in a wide range of tasks. While there are ongoing efforts to adapt these models to languages beyond English, the attention given to their evaluation methodologies remains limited. Current multilingual benchmarks often rely on back translations or re-implementations of English tests, limiting their capacity to capture unique cultural and linguistic nuances. To bridge this gap for the Korean language, we introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension. Unlike traditional evaluation suites focused on token and sequence classification or mathematical and logical reasoning, the HAE-RAE Bench emphasizes a model's aptitude for recalling Korean-specific knowledge and cultural contexts. Comparative analysis with prior Korean benchmarks indicates that the HAE-RAE Bench presents a greater challenge to non-Korean models by disturbing abilities and knowledge learned from English being transferred.
- ai forever. 2023. rugpt-3.5 13b.
- Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
- Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
- Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- A framework for few-shot language model evaluation.
- SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
- KorNLI and KorSTS: New benchmark datasets for Korean natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 422–430, Online. Association for Computational Linguistics.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. arXiv preprint arXiv:2305.07004.
- Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. arXiv preprint arXiv:1808.04926.
- What changes can large-scale language models bring? intensive study on HyperCLOVA: Billions-scale Korean generative pretrained transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3405–3424, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Kobest: Korean balanced evaluation of significant tasks. arXiv preprint arXiv:2204.04541.
- A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254.
- JGLUE: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2957–2966, Marseille, France. European Language Resources Association.
- Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317.
- Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models.
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Cascading biases: Investigating the effect of heuristic annotation strategies on data and models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6525–6540, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Klue: Korean language understanding evaluation. arXiv preprint arXiv:2105.09680.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2018. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121.
- Sabi\\\backslash\’a: Portuguese large language models. arXiv preprint arXiv:2304.07880.
- QwenLM. 2023. Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts).
- Improving language understanding by generative pre-training.
- Sentineg: Algorithm to process negations at sentence level in sentiment analysis. International Journal of Software Innovation, 11:1–27.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057.
- Removing non-stationary knowledge from pre-trained language models for entity-level sentiment classification in finance. arXiv preprint arXiv:2301.03136.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- StabilityAI. 2023. Japanese-stablelm-base-alpha-7b.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Vicuna. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Baichuan 2: Open large-scale language models.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Enhancing cross-lingual prompting with dual prompt augmentation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11008–11020.
- Judit Ács. 2019. Exploring bert’s vocabulary.
- Guijin Son (20 papers)
- Hanwool Lee (15 papers)
- Suwan Kim (1 paper)
- Huiseo Kim (2 papers)
- Jaecheol Lee (2 papers)
- Je Won Yeom (3 papers)
- Jihyu Jung (1 paper)
- Jung Woo Kim (3 papers)
- Songseong Kim (1 paper)