FarsEval-PKBETS: Persian NLP Benchmark
- The paper presents the first human-verified Persian KBP corpus with 22,015 triples and evaluates LLMs using 4,000 culturally-tailored Q–A tasks.
- FarsEval-PKBETS is a rigorously curated benchmark set that addresses Persian-specific challenges in entity linking, relation extraction, and domain adaptation.
- Evaluation results reveal critical gaps in LLM performance on Persian tasks, underlining the need for domain-specific fine-tuning and improved linguistic models.
FarsEval-PKBETS denotes a suite of rigorously curated benchmarks targeting two fundamental tasks in Persian natural language processing: knowledge base population (KBP) for the FarsBase Persian knowledge graph, and comprehensive evaluation of LLMs on a culturally, linguistically, and domain-diverse set of reasoning tasks. The term delineates both (1) the first large-scale, human-verified corpus for Persian KBP (Asgari-Bidhendi et al., 2020) and (2) a diverse, expert-driven challenge set for assessing LLMs on a spectrum of domains critical for Persian applications (Shamsfard et al., 20 Apr 2025).
1. Motivation and Scope
The FarsEval-PKBETS benchmarks arose from the acute scarcity of representative, high-fidelity resources for evaluating language technologies in Persian. While languages such as English benefit from mature KBP corpora and a surfeit of LLM evaluation benchmarks spanning diverse knowledge, reasoning, and generative tasks, Persian has historically lacked both KB-scale gold datasets and challenge sets reflecting its unique cultural, linguistic, and regulatory contexts. Existing evaluation resources for Persian tend to derive from translated English datasets, focus narrowly on single domains (e.g., standardized exams, medicine), and rarely probe domain-transfer, generation skills, or culturally situated reasoning (Shamsfard et al., 20 Apr 2025). FarsEval-PKBETS addresses these gaps by offering:
- A human-verified KBP gold standard (22,015 triples) supporting system-level KBP evaluation (Asgari-Bidhendi et al., 2020).
- An LLM benchmark (4,000 Q–A pairs) spanning medicine, law, religion, language, encyclopedic knowledge, social and human preferences, ethics, bias, and generative text tasks, with explicit emphasis on Persian-specific content (Shamsfard et al., 20 Apr 2025).
2. FarsEval-PKBETS for Knowledge Base Population
The KBP component is built on the FarsBase-KBP gold corpus—the first fully human-verified evaluation set for Persian knowledge graph construction (Asgari-Bidhendi et al., 2020). Corpus creation entailed crawling all Persian Wikipedia articles containing both subject and object of an existing FarsBase triple, followed by expert annotation selecting the correct predicate from 406 predicate classes. The annotation schema encompasses:
- Normalization of subjects and objects to FarsBase entity IDs (via ParsEL).
- Use of FarsBase ontology URIs (e.g., fkgo:birthPlace) as predicates.
- Enrichment of each instance with tokens, POS tags, NER tags, dependency parse, linked entity types, and FarsBase classes.
Key corpus statistics:
| Metric | Value |
|---|---|
| Sentences/triples | 22,015 |
| Unique predicates | 406 |
| Unique entities | 11,024 |
| Avg. examples/pred. | 54.2 (σ≈102.1) |
Entity types (subjects + objects): 45.2% Person, 20.1% Location, 14.8% Organization, 9.3% Work, 10.6% Miscellaneous.
Predicate coverage includes both dense and rare relations, with the top-10 predicates accounting for significant frequency diversity.
The corpus is randomly stratified by predicate: 70% train, 15% dev, 15% test.
3. FarsEval-PKBETS for LLM Evaluation
The LLM evaluation benchmark comprises 4,000 Persian question–answer items, designed to probe factual knowledge, reasoning, and text generation in a variety of formats:
| Format | Count | Percentage |
|---|---|---|
| Multiple-choice (MCQ) | 2,000 | 50% |
| Short-answer (SAQ) | 1,200 | 30% |
| Descriptive (DQ) | 800 | 20% |
Twelve primary domains are covered, including medicine (500), law (350), religion (250), Persian language (250), encyclopedic knowledge (300), social knowledge (500), text generation (450), and others. Sourcing draws from Iranian residency exams, official legal texts, religious jurisprudence, and real-world/social media scenarios. Annotation emphasizes exploitation of Persian-specific ambiguity, idiomatic expressions, and sub-dialectal variation.
A web-based annotation and review platform (“Saba”) is used, supporting live LLM evaluation, chain-of-thought labeling, metadata tracking, and a public leaderboard.
4. Benchmark Evaluation Methodology
For KBP, model performance is assessed at the triple (subject, predicate, object) level. Metrics incorporate:
- Precision (), recall (), and F₁ at strict triple match level:
- Micro- and macro-averages over predicates:
Analogous formulas apply for and .
- Area under the precision-recall curve (AUC-PR).
- Extraction throughput (triples/sentence).
For LLMs, two accuracy metrics are used (Shamsfard et al., 20 Apr 2025):
- Strict accuracy: only fully correct answers are counted.
- Lenient accuracy: includes “semi-correct” responses (e.g., minor errors or correct answer/incorrect label).
The task setup is zero-shot; models are prompted exclusively in Persian and tested across all formats and domains without few-shot examples.
5. System Architectures and Module Performance
KBP System
The FarsBase-KBP system (Asgari-Bidhendi et al., 2020) operates in five stages:
- Web crawling and preprocessing
- Entity linking (ParsEL: unsupervised, mixing contextual and surface features)
- Relation/information extraction via an ensemble of six modules: PredPatt (Universal Dependencies-based Open IE), DependencyPattern (frequent syntactic templates), PSIE (combined DEP/CONSTIT parsing), RePersian (regex from Dadegan Treebank), TokensRegex (handcrafted rules), and Distant Supervision (PCNN, multi-instance learning)
- Canonicalization via infobox alignment and lexicon matching
- Knowledge fusion (ensemble: accept if found by ≥2 modules, or confidence ≥0.90 in one module)
Human experts verify all candidate triples before addition to FarsBase.
| Module | Precision | Recall | F₁ | Triples/Sentence |
|---|---|---|---|---|
| DependencyPattern | 0.7474 | 0.0032 | 0.0064 | 0.019 |
| DistantSup | 0.2610 | 0.2104 | 0.2330 | 0.806 |
| PredPatt | 0.1368 | 0.0006 | 0.0012 | 3.015 |
| RePersian | 0.1747 | 0.0023 | 0.0046 | 0.357 |
| TokensRegex | 0.7829 | 0.1502 | 0.2520 | 1.697 |
| PSIE | 0.1626 | 0.0043 | 0.0083 | 2.035 |
| Fusion (θ=0.90) | 0.7313 | 0.1779 | 0.2862 | 1.812 |
AUC-PR for the fusion system on TEST is 0.412. Micro and macro F₁ on TEST are 0.285 and 0.231, respectively. Ablation experiments demonstrate the precision-recall trade-off as the fusion confidence threshold θ varies.
LLM Results
Three LLMs were evaluated (Shamsfard et al., 20 Apr 2025):
| Model | Strict Acc. | Lenient Acc. |
|---|---|---|
| Llama 3-70B | 0.47 | 0.54 |
| PersianMind | 0.19 | 0.23 |
| Dorna (8B) | 0.30 | 0.39 |
Performance varies across domains: best results on NLP paraphrase (0.70 accuracy), and the highest generative successes in personality-centric and style-centric tasks. The weakest LLM performance is in emergency medicine (0.14 accuracy), alternative medicine (0.27), and Persian grammar/proverbs (0.24). Even Llama 3-70B remains below 0.60 strict accuracy in core domain knowledge tasks.
6. Error Analysis and Challenges
Systematic limitations identified in KBP include low recall and overgeneration in certain extractors, noisy distant supervision, entity linking errors, shortcomings in parsing Persian’s morphosyntax, and failures in canonicalization especially for idiomatic verbs (~12% error rate). LLM evaluation exposes deficiencies in justification generation (with 43% of multiple-choice chain-of-thought justifications inconsistent), fluency and factuality in free-form answers, and pronounced struggles with genuinely local or specialized knowledge (medical, legal), as well as cultural reasoning (ethics, bias, respecting rights).
Common KBP failure cases: nested and coordinative structures (e.g., “A and B founded C”), implicit relations (requiring world knowledge), and non-standard Persian orthography in Wikipedia source texts.
Recommendations for the KBP track include expanding data sources, integrating multilingual embeddings (mBERT), crowd-based validation for ambiguous cases, semi-automatic rule induction for extraction, more robust canonicalization (e.g., via ELMo clustering), and second-tier negative example releases.
For LLMs, the data highlights persistent bottlenecks in Persian-centric domain adaptation and reasoning, indicating the need for deeper, domain-specific fine-tuning and more deliberate integration of local factuality and reasoning priors.
7. Impact and Prospects
FarsEval-PKBETS sets new standards for rigor, coverage, and challenge in Persian-language NLP evaluation (Asgari-Bidhendi et al., 2020, Shamsfard et al., 20 Apr 2025). Its KBP corpus has enabled the first micro–macro comparative evaluation of Persian KBP systems at scale, with established methodology for per-predicate metrics and detailed ablation of fusion rules. The LLM benchmark provides a uniquely demanding set of generation, reasoning, and cultural adaptation tasks, sharply exposing the sub-50% accuracy ceiling of current state-of-the-art models.
Potential extensions include dynamic updates to reflect legal and regulatory change, new modules for low-resource Persian dialects, and more rigorous generative and interactive evaluation (multi-turn legal/medical consultation scenarios). The released leaderboards and annotation APIs further support transparent, ongoing benchmarking.
FarsEval-PKBETS, by combining exhaustive expert-driven annotation, diverse domain/task distribution, and public infrastructure for evaluation, establishes a replicable framework for benchmarking in other under-resourced languages and offers empirical guidance for subsequent model and data development in Persian.