ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
Abstract: The rapid advancement of LLMs necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese LLM shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).
- “Towards a Cleaner Document-Oriented Multilingual Crawled Corpus” In arXiv e-prints, 2022, pp. arXiv:2201.06642 arXiv:2201.06642 [cs.CL]
- “Proceedings of the Ninth Workshop on Statistical Machine Translation” Baltimore, Maryland, USA: Association for Computational Linguistics, 2014 DOI: 10.3115/v1/W14-33
- “Language Models are Few-Shot Learners”, 2020 arXiv:2005.14165 [cs.CL]
- “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge” In arXiv:1803.05457v1, 2018
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv:1810.04805 [cs.CL]
- “Sentence Extraction-Based Machine Reading Comprehension for Vietnamese”, 2021 arXiv:2105.09043 [cs.CL]
- Kenneth Heafield “KenLM: Faster and Smaller Language Model Queries” In Proceedings of the Sixth Workshop on Statistical Machine Translation Edinburgh, Scotland: Association for Computational Linguistics, 2011, pp. 187–197 URL: https://aclanthology.org/W11-2123
- “Measuring Massive Multitask Language Understanding”, 2021 arXiv:2009.03300 [cs.CY]
- “Long Short-Term Memory” In Neural Comput. 9.8 Cambridge, MA, USA: MIT Press, 1997, pp. 1735–1780 DOI: 10.1162/neco.1997.9.8.1735
- Stephanie Lin, Jacob Hilton and Owain Evans “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, 2022 arXiv:2109.07958 [cs.CL]
- Shashi Narayan, Shay B. Cohen and Mirella Lapata “Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1797–1807 DOI: 10.18653/v1/D18-1206
- “Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese”, 2023
- “PhoGPT: Generative Pre-training for Vietnamese” In arXiv preprint arXiv:2311.02945, 2023
- “A Vietnamese Dataset for Evaluating Machine Reading Comprehension”, 2020 arXiv:2009.14725 [cs.CL]
- “New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles”, 2021 arXiv:2006.11138 [cs.CL]
- Quan Nguyen, Huy Pham and Dung Dao “VinaLLaMA: LLaMA-based Vietnamese Foundation Model”, 2023 arXiv:2312.11011 [cs.CL]
- “SeaLLMs - Large Language Models for Southeast Asia”, 2023 eprint: arXiv:2312.00738
- OpenAI “Introducing ChatGPT” In Introducing ChatGPT, 2022 URL: https://openai.com/blog/chatgpt
- “The LAMBADA dataset: Word prediction requiring a broad discourse context”, 2016 arXiv:1606.06031 [cs.CL]
- “Language Models are Unsupervised Multitask Learners”, 2019 URL: https://api.semanticscholar.org/CorpusID:160025533
- “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, 2022 arXiv:2210.09261 [cs.CL]
- “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
- “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, 2019 arXiv:1804.07461 [cs.CL]
- ZaloAI-Jaist “A Vietnamese Multitask Language Understanding Benchmark Suite for Large Language Models” In GitHub repository GitHub, https://github.com/ZaloAI-Jaist/VMLU, 2023
- “HellaSwag: Can a Machine Really Finish Your Sentence?” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.