Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic (2402.12840v2)

Published 20 Feb 2024 in cs.CL

Abstract: The focus of LLM evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present \datasetname{}, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7088–7105, Online. Association for Computational Linguistics.
  2. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
  3. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15, Marseille, France. European Language Resource Association.
  4. AraELECTRA: Pre-training text discriminators for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 191–195, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
  5. AraGPT2: Pre-trained transformer for Arabic language generation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 196–207, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
  6. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  8. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  11. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  12. Arabic POS tagging: Don’t abandon feature engineering just yet. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 130–137, Valencia, Spain. Association for Computational Linguistics.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  14. NLP for Arabic and related languages. Traitement Automatique des Langues, 58(3):9–13.
  15. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics.
  16. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. Association for Computational Linguistics.
  17. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  18. Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 1301–1312. Association for Computational Linguistics.
  19. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of ICML 2020.
  20. Acegpt, localizing large language models in arabic. arXiv preprint arXiv:2309.12053.
  21. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
  22. The interplay of variant, size, and task type in Arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 92–104, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
  23. AraBART: a pretrained Arabic sequence-to-sequence model for abstractive summarization. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 31–42, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  24. Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7811–7818. Association for Computational Linguistics.
  25. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
  26. Cloze evaluation for deeper understanding of commonsense stories in Indonesian. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 8–16, Dublin, Ireland. Association for Computational Linguistics.
  27. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
  28. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
  29. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330, Online. Association for Computational Linguistics.
  30. CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.
  31. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
  32. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  33. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
  34. Neural Arabic question answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 108–118, Florence, Italy. Association for Computational Linguistics.
  35. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  36. AraT5: Text-to-text transformers for Arabic language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 628–647, Dublin, Ireland. Association for Computational Linguistics.
  37. OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  39. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  40. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  41. Fairness in language models beyond English: Gaps and challenges. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2106–2119, Dubrovnik, Croatia. Association for Computational Linguistics.
  42. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  43. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
  44. Abdulhadi Shoufan and Sumaya Alameri. 2015. Natural language processing for dialectical Arabic: A survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 36–48, Beijing, China. Association for Computational Linguistics.
  45. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41, virtual+Dublin. Association for Computational Linguistics.
  46. Llama2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  47. Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 101–114, Toronto, Canada. Association for Computational Linguistics.
  48. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  49. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–12.
  50. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Fajri Koto (47 papers)
  2. Haonan Li (43 papers)
  3. Sara Shatnawi (6 papers)
  4. Jad Doughman (4 papers)
  5. Abdelrahman Boda Sadallah (2 papers)
  6. Aisha Alraeesi (3 papers)
  7. Khalid Almubarak (9 papers)
  8. Zaid Alyafeai (21 papers)
  9. Neha Sengupta (8 papers)
  10. Shady Shehata (17 papers)
  11. Nizar Habash (66 papers)
  12. Preslav Nakov (253 papers)
  13. Timothy Baldwin (125 papers)
Citations (21)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets