Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? (2404.06644v1)

Published 9 Apr 2024 in cs.CL and cs.AI
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Abstract: Evaluating LLMs is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

Evaluation and Insights from the Khayyam Challenge: A Benchmark for Persian Language Understanding in LLMs

Introduction

The landscape of LLMs evaluation has been enriched with the introduction of the Khayyam Challenge, also known as PersianMMLU. This comprehensive benchmark aims to rigorously assess LLMs' understanding of the Persian language through a diverse array of subjects and complexities. The challenge is named after Omar Khayyam, reflecting the multidisciplinary nature of the tasks it comprises. Unique in its construction, the Khayyam Challenge leverages questions sourced from the Iranian educational context, extending from lower primary to upper secondary education levels. This initiative addresses critical gaps in non-English LLM evaluations and sets the stage for future advancements in Persian language processing.

Data Characteristics

The dataset, derived from Iran's "Pellekan Yadgiri" website and the esteemed Kanoon Farhangi Amoozesh, spans 38 subjects with a total of 20,192 four-choice questions. These subjects range widely from mathematics and science to humanities, each requiring a mix of language comprehension, reasoning, and knowledge retrieval. Notable for its high-quality and expert-validated content, Khayyam Challenge stands out by including:

  • Rich Metadata: Information on difficulty levels, educational stages, and detailed explanations for each question.
  • Original, Non-Translated Content: Specifically tailored for Persian, avoiding the common pitfalls of translated data.
  • Comprehensive Coverage and Scalability: From literary comprehension to logical reasoning across various educational stages.

Evaluation Methodology

The paper describes a meticulous evaluation of several state-of-the-art LLMs, including GPT-3.5 and GPT-4, across this comprehensive dataset. A significant part of this paper is the use of different methods for answer extraction, like Regex and Probability approaches, alongside traditional performance metrics. Notably, the analysis includes a detailed comparison of LLMs' performance against human benchmarks, shedding light on current models' limitations and areas requiring improvement.

Observations and Insights

The results presented underscore several important findings:

  • Performance Gaps: LLMs, including GPT-4, demonstrated promising yet still lagging performance compared to human benchmarks. This discrepancy was particularly noticeable in tasks requiring advanced reasoning, such as those in the mathematics and natural sciences categories.
  • Model Comparisons: Among the evaluated models, GPT-4 showed superior performance, yet with a noted need for enhancement to reach human-like understanding and reasoning in Persian.
  • Rich Metadata Utilization: The analysis of metadata, such as question difficulty and the presence of traps, provided deeper insights into the models' operational nuances.

Implications for Future Research

The Khayyam Challenge not only marks a significant advancement in evaluating Persian language understanding in LLMs but also opens several avenues for future research. The detailed insights into model performances and the comprehensive nature of the dataset pave the way for targeted improvements in model architectures and training methodologies. Moreover, the scalable framework of the Khayyam Challenge allows for easy updates and expansions, ensuring its relevance and utility in the fast-evolving field of AI and language understanding.

Concluding Remarks

In summary, the Khayyam Challenge represents a pivotal step towards a deeper and more nuanced understanding of Persian language processing in LLMs. By offering a rigorous, varied, and scalable benchmark, it provides a valuable resource for researchers aiming to push the boundaries of AI language capabilities. The insights gained from this challenge highlight the existing gaps in LLMs' understanding and reasoning in Persian, offering clear directions for future advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. ParSQuAD: Machine translated SQuAD dataset for persian question answering. In 2021 7th International Conference on Web Research (ICWR). IEEE, May 2021.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024.
  4. Mohammad Yasin Ayoubi, Sajjad & Davoodeh. Persianqa: a dataset for persian question answering. https://github.com/SajjjadAyobi/PersianQA, 2021.
  5. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  6. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., jan 2024. ISSN 2157-6904. doi: 10.1145/3641289. URL https://doi.org/10.1145/3641289. Just Accepted.
  7. Pquad: A persian question answering dataset. Computer Speech & Language, 80:101486, 2023. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2023.101486. URL https://www.sciencedirect.com/science/article/pii/S0885230823000050.
  8. Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719, 2023.
  9. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
  10. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  11. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36, 2024.
  12. Scaling laws for neural language models, 2020.
  13. Parsinlu: A suite of language understanding challenges for persian, 2021.
  14. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  15. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  16. OpenAI. Gpt-4 technical report, 2023.
  17. PersianMind: A Cross-Lingual Persian-English Large Language Model, 2024.
  18. mgpt: Few-shot learners go multilingual, 2022. URL https://arxiv.org/abs/2204.07580.
  19. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  20. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.
  21. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  22. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=hJPATsBb3l.
  23. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com