Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models (2504.14690v1)

Published 20 Apr 2025 in cs.CL and cs.AI

Abstract: Research on evaluating and analyzing LLMs has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating LLMs in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current LLMs are still far from being able to solve this benchmark

FarsEval-PKBETS: A Benchmark for Persian LLMs

The paper introduces FarsEval-PKBETS, a structured benchmark designed to evaluate the performance of LLMs specifically catered to the Persian language. The development of this benchmark responds to the notable neglect in evaluating LLMs in low-resource languages like Persian, contrasted with the extensive body of research focused on high-resource languages, particularly English.

Composition and Methodology

FarsEval-PKBETS comprises 4,000 questions in various response formats, including multiple-choice, short-answer, and descriptive prompts. The benchmark covers an extensive range of domains: medicine, law, religion, the Persian language, social and encyclopedic knowledge, ethics, and text generation tasks. A significant highlight is its incorporation of linguistic, cultural, and local nuances pertinent to Persian and Iranian contexts, often absent in non-English benchmarks. The dataset also emphasizes challenging question designs to push the limits of current LLM competencies.

The benchmark methodology included contributions from experts across different fields, ensuring the datasets are both challenging and accurately reflect domain-specific knowledge. This diverse input is key to generating authentic and culturally resonant questions.

Evaluation and Results

The paper evaluates three distinct models: Llama3-70B, PersianMind, and Dorna, using the FarsEval-PKBETS benchmark. The averaged accuracy of these models falls below 50%, underscoring the challenges these LLMs face when tested against this Persian-specific benchmark. For example, accuracy varies significantly across categories, with performance in general medicine and lexical semantics notably higher than in Persian language tasks and text generation. These results spotlight the models' difficulty in comprehending culturally and linguistically specific content, indicating substantive room for improvement in the way LLMs handle non-English languages.

Significance and Future Directions

This research implies both a theoretical and practical significance. Theoretically, it stresses the need for LLMs to better integrate cultural and linguistic nuances, particularly those outside high-resource languages. Practically, it highlights the inadequacy of current evaluations or models and suggests a pathway for LLM enhancement. The benchmark may guide future improvements in LLMs targeting Persian, and potentially extend a framework for other underrepresented languages.

Further investigations might expand FarsEval-PKBETS to include a wider array of linguistic features or incorporate real-world applications tailored to native speakers' cultural contexts. Additionally, the persistent advancement in AI models mandates the continual development of benchmarks that adapt to and anticipate future capabilities of evolving LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Mehrnoush Shamsfard (20 papers)
  2. Zahra Saaberi (1 paper)
  3. Mostafa Karimi manesh (1 paper)
  4. Seyed Mohammad Hossein Hashemi (2 papers)
  5. Zahra Vatankhah (1 paper)
  6. Motahareh Ramezani (1 paper)
  7. Niki Pourazin (1 paper)
  8. Tara Zare (1 paper)
  9. Maryam Azimi (5 papers)
  10. Sarina Chitsaz (1 paper)
  11. Sama Khoraminejad (1 paper)
  12. Morteza Mahdavi Mortazavi (1 paper)
  13. Mohammad Mahdi Chizari (1 paper)
  14. Sahar Maleki (1 paper)
  15. Seyed Soroush Majd (2 papers)
  16. Mostafa Masumi (2 papers)
  17. Sayed Ali Musavi Khoeini (1 paper)
  18. Amir Mohseni (2 papers)
  19. Sogol Alipour (1 paper)
Youtube Logo Streamline Icon: https://streamlinehq.com