FarsEval-PKBETS: A Benchmark for Persian LLMs
The paper introduces FarsEval-PKBETS, a structured benchmark designed to evaluate the performance of LLMs specifically catered to the Persian language. The development of this benchmark responds to the notable neglect in evaluating LLMs in low-resource languages like Persian, contrasted with the extensive body of research focused on high-resource languages, particularly English.
Composition and Methodology
FarsEval-PKBETS comprises 4,000 questions in various response formats, including multiple-choice, short-answer, and descriptive prompts. The benchmark covers an extensive range of domains: medicine, law, religion, the Persian language, social and encyclopedic knowledge, ethics, and text generation tasks. A significant highlight is its incorporation of linguistic, cultural, and local nuances pertinent to Persian and Iranian contexts, often absent in non-English benchmarks. The dataset also emphasizes challenging question designs to push the limits of current LLM competencies.
The benchmark methodology included contributions from experts across different fields, ensuring the datasets are both challenging and accurately reflect domain-specific knowledge. This diverse input is key to generating authentic and culturally resonant questions.
Evaluation and Results
The paper evaluates three distinct models: Llama3-70B, PersianMind, and Dorna, using the FarsEval-PKBETS benchmark. The averaged accuracy of these models falls below 50%, underscoring the challenges these LLMs face when tested against this Persian-specific benchmark. For example, accuracy varies significantly across categories, with performance in general medicine and lexical semantics notably higher than in Persian language tasks and text generation. These results spotlight the models' difficulty in comprehending culturally and linguistically specific content, indicating substantive room for improvement in the way LLMs handle non-English languages.
Significance and Future Directions
This research implies both a theoretical and practical significance. Theoretically, it stresses the need for LLMs to better integrate cultural and linguistic nuances, particularly those outside high-resource languages. Practically, it highlights the inadequacy of current evaluations or models and suggests a pathway for LLM enhancement. The benchmark may guide future improvements in LLMs targeting Persian, and potentially extend a framework for other underrepresented languages.
Further investigations might expand FarsEval-PKBETS to include a wider array of linguistic features or incorporate real-world applications tailored to native speakers' cultural contexts. Additionally, the persistent advancement in AI models mandates the continual development of benchmarks that adapt to and anticipate future capabilities of evolving LLMs.