Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models (2404.11086v2)

Published 17 Apr 2024 in cs.CL and cs.AI

Abstract: The rapid advancement of LLMs necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese LLM shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Towards a Cleaner Document-Oriented Multilingual Crawled Corpus” In arXiv e-prints, 2022, pp. arXiv:2201.06642 arXiv:2201.06642 [cs.CL]
  2. “Proceedings of the Ninth Workshop on Statistical Machine Translation” Baltimore, Maryland, USA: Association for Computational Linguistics, 2014 DOI: 10.3115/v1/W14-33
  3. “Language Models are Few-Shot Learners”, 2020 arXiv:2005.14165 [cs.CL]
  4. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge” In arXiv:1803.05457v1, 2018
  5. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv:1810.04805 [cs.CL]
  6. “Sentence Extraction-Based Machine Reading Comprehension for Vietnamese”, 2021 arXiv:2105.09043 [cs.CL]
  7. Kenneth Heafield “KenLM: Faster and Smaller Language Model Queries” In Proceedings of the Sixth Workshop on Statistical Machine Translation Edinburgh, Scotland: Association for Computational Linguistics, 2011, pp. 187–197 URL: https://aclanthology.org/W11-2123
  8. “Measuring Massive Multitask Language Understanding”, 2021 arXiv:2009.03300 [cs.CY]
  9. “Long Short-Term Memory” In Neural Comput. 9.8 Cambridge, MA, USA: MIT Press, 1997, pp. 1735–1780 DOI: 10.1162/neco.1997.9.8.1735
  10. Stephanie Lin, Jacob Hilton and Owain Evans “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, 2022 arXiv:2109.07958 [cs.CL]
  11. Shashi Narayan, Shay B. Cohen and Mirella Lapata “Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1797–1807 DOI: 10.18653/v1/D18-1206
  12. “Vistral-7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese”, 2023
  13. “PhoGPT: Generative Pre-training for Vietnamese” In arXiv preprint arXiv:2311.02945, 2023
  14. “A Vietnamese Dataset for Evaluating Machine Reading Comprehension”, 2020 arXiv:2009.14725 [cs.CL]
  15. “New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles”, 2021 arXiv:2006.11138 [cs.CL]
  16. Quan Nguyen, Huy Pham and Dung Dao “VinaLLaMA: LLaMA-based Vietnamese Foundation Model”, 2023 arXiv:2312.11011 [cs.CL]
  17. “SeaLLMs - Large Language Models for Southeast Asia”, 2023 eprint: arXiv:2312.00738
  18. OpenAI “Introducing ChatGPT” In Introducing ChatGPT, 2022 URL: https://openai.com/blog/chatgpt
  19. “The LAMBADA dataset: Word prediction requiring a broad discourse context”, 2016 arXiv:1606.06031 [cs.CL]
  20. “Language Models are Unsupervised Multitask Learners”, 2019 URL: https://api.semanticscholar.org/CorpusID:160025533
  21. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, 2022 arXiv:2210.09261 [cs.CL]
  22. “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
  23. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, 2019 arXiv:1804.07461 [cs.CL]
  24. ZaloAI-Jaist “A Vietnamese Multitask Language Understanding Benchmark Suite for Large Language Models” In GitHub repository GitHub, https://github.com/ZaloAI-Jaist/VMLU, 2023
  25. “HellaSwag: Can a Machine Really Finish Your Sentence?” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Trong-Hieu Nguyen (1 paper)
  2. Anh-Cuong Le (3 papers)
  3. Viet-Cuong Nguyen (5 papers)