ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models (2404.11086v2)
Abstract: The rapid advancement of LLMs necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese LLM shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).
- “Towards a Cleaner Document-Oriented Multilingual Crawled Corpus” In arXiv e-prints, 2022, pp. arXiv:2201.06642 arXiv:2201.06642 [cs.CL]
- “Proceedings of the Ninth Workshop on Statistical Machine Translation” Baltimore, Maryland, USA: Association for Computational Linguistics, 2014 DOI: 10.3115/v1/W14-33
- “Language Models are Few-Shot Learners”, 2020 arXiv:2005.14165 [cs.CL]
- “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge” In arXiv:1803.05457v1, 2018
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv:1810.04805 [cs.CL]
- “Sentence Extraction-Based Machine Reading Comprehension for Vietnamese”, 2021 arXiv:2105.09043 [cs.CL]
- Kenneth Heafield “KenLM: Faster and Smaller Language Model Queries” In Proceedings of the Sixth Workshop on Statistical Machine Translation Edinburgh, Scotland: Association for Computational Linguistics, 2011, pp. 187–197 URL: https://aclanthology.org/W11-2123
- “Measuring Massive Multitask Language Understanding”, 2021 arXiv:2009.03300 [cs.CY]
- “Long Short-Term Memory” In Neural Comput. 9.8 Cambridge, MA, USA: MIT Press, 1997, pp. 1735–1780 DOI: 10.1162/neco.1997.9.8.1735
- Stephanie Lin, Jacob Hilton and Owain Evans “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, 2022 arXiv:2109.07958 [cs.CL]
- Shashi Narayan, Shay B. Cohen and Mirella Lapata “Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1797–1807 DOI: 10.18653/v1/D18-1206
- “Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese”, 2023
- “PhoGPT: Generative Pre-training for Vietnamese” In arXiv preprint arXiv:2311.02945, 2023
- “A Vietnamese Dataset for Evaluating Machine Reading Comprehension”, 2020 arXiv:2009.14725 [cs.CL]
- “New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles”, 2021 arXiv:2006.11138 [cs.CL]
- Quan Nguyen, Huy Pham and Dung Dao “VinaLLaMA: LLaMA-based Vietnamese Foundation Model”, 2023 arXiv:2312.11011 [cs.CL]
- “SeaLLMs - Large Language Models for Southeast Asia”, 2023 eprint: arXiv:2312.00738
- OpenAI “Introducing ChatGPT” In Introducing ChatGPT, 2022 URL: https://openai.com/blog/chatgpt
- “The LAMBADA dataset: Word prediction requiring a broad discourse context”, 2016 arXiv:1606.06031 [cs.CL]
- “Language Models are Unsupervised Multitask Learners”, 2019 URL: https://api.semanticscholar.org/CorpusID:160025533
- “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, 2022 arXiv:2210.09261 [cs.CL]
- “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
- “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, 2019 arXiv:1804.07461 [cs.CL]
- ZaloAI-Jaist “A Vietnamese Multitask Language Understanding Benchmark Suite for Large Language Models” In GitHub repository GitHub, https://github.com/ZaloAI-Jaist/VMLU, 2023
- “HellaSwag: Can a Machine Really Finish Your Sentence?” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
- Trong-Hieu Nguyen (1 paper)
- Anh-Cuong Le (3 papers)
- Viet-Cuong Nguyen (5 papers)