Hebrew LLM Benchmark Suite

Updated 5 March 2026

Hebrew LLM Benchmark Suite is a structured set of tasks and datasets designed to assess LLM performance on Hebrew, addressing morphological and syntactic challenges.
It employs specialized evaluation metrics like TLNLS and human/GPT-4 judgments to accurately gauge both extractive and generative abilities.
Empirical findings show that language-specific models often outperform multilingual baselines, highlighting the need for Hebrew-centric methodologies.

A Hebrew LLM Benchmark Suite is a structured collection of evaluation tasks and datasets designed to assess the performance of LLMs on Modern Hebrew, with explicit attention to the unique morphological and linguistic properties of the language. These suites target both closed-domain skills (e.g., factual knowledge, reasoning, and linguistic processing) and open-ended generative abilities (e.g., abstractive summarization, translation, and diacritization), addressing the particular challenges posed by Hebrew as a morphologically rich language (MRL). The emergence of these benchmarks follows the rapid development of sovereign Hebrew LLMs (such as DictaLM 2.0 and 3.0) and the need for rigorous, Hebrew-centric methodologies for evaluation and comparison (Shmidman et al., 2024, Shmidman et al., 2 Feb 2026, Paz-Argaman et al., 2024, Cohen et al., 3 Aug 2025). Benchmark suites are informed by advances in dataset curation, specialized evaluation metrics (e.g., TLNLS, pairwise GPT-4 judgments), and a commitment to public leaderboards for transparent progress tracking.

1. Motivation and Principles of Hebrew LLM Benchmarking

Recent advances in LLMs for English and high-resource languages have highlighted gaps in both resources and evaluation methodologies for Hebrew. Most existing Hebrew NLP benchmarks emphasized morpho-syntactic tasks (e.g., POS tagging, morphological inflection), with minimal coverage of semantic comprehension, generative abstraction, or complex dialogue (Cohen et al., 3 Aug 2025). The high morphological richness of Hebrew—including pervasive affixation, flexible word order, and a lack of explicit diacritics—complicates the application of standard evaluation metrics and prompts the development of tailored approaches (Paz-Argaman et al., 2024, Shmidman et al., 2024). Benchmarking suites are thus constructed to:

Enable apples-to-apples comparison of both open and proprietary LLMs on a spectrum of Hebrew-specific NLP tasks.
Address both extractive and generative capabilities, including reading comprehension, classification, translation, and summarization.
Incorporate metrics aware of Hebrew’s morphological and orthographic properties, mitigating the biases of standard token-level metrics.
Provide open leaderboards and benchmark definitions for reproducibility and community benchmarking (Shmidman et al., 2024, Shmidman et al., 2 Feb 2026).

2. Core Datasets and Tasks

Hebrew LLM benchmark suites span multiple domains and evaluation regimes, integrating both well-established and novel datasets. The principal components across recent suites are:

Task	Key Dataset(s)	Evaluation Protocol
Machine Reading Comprehension (QA)	HeQ (Cohen et al., 3 Aug 2025), ParaShoot	TLNLS, Accuracy
Sentiment Analysis	Hebrew Sentiment (2024)	Accuracy, Few-shot prompts
Winograd Schema Challenge	Hebrew Winograd [Shwartz 2021]	Accuracy, Few-shot/zero-shot
Translation	NeuLabs-TedTalks, web-crawled corpora	BLEU, Pairwise GPT-4o judgments
Abstractive Summarization	HeSum (Paz-Argaman et al., 2024), news + wiki	ROUGE, BertScore, Human/GPT-4o ratings
Diacritization (Nikud Restoration)	In-house curated corpus	Word-level macro accuracy
Israeli Trivia QA	300 curated by Avraham Elitzur	Exact-match accuracy

This design covers comprehension (QA), reasoning (Winograd), classification (sentiment), generation (summarization, translation), and Hebrew-specific tasks (diacritization), each tailored to Hebrew’s linguistic specificity (Shmidman et al., 2024, Shmidman et al., 2 Feb 2026, Paz-Argaman et al., 2024).

3. Dataset Construction and Linguistic Characterization

Dataset curation methodologies prioritize both representativeness and control for Hebrew morphological diversity:

HeSum Abstractive Summarization (Paz-Argaman et al., 2024): 10,000 article–summary pairs from independent journalism (train: 8,000 / val: 1,000 / test: 1,000), with professional journalists as annotators and stringent filtering for summary length and fluency. Linguistic analyses reveal 42% novel unigrams and 73% novel bigrams in summaries (vs. 13%/36% in major English corpora), and extreme compression ratios (≈4.5%). Morphological ambiguity is measured by word-lattice ambiguity (≈50 segmentations/token) and construct-state usage.
HeQ Reading Comprehension (Cohen et al., 3 Aug 2025): 30,147 QA pairs, evenly split between Hebrew Wikipedia and GeekTime tech news, emphasizing questions requiring both surface-form and inference-driven answer selection. Annotation permits flexible span boundaries within tokens and systematic crowdsourced quality control. Multiple valid spans collected per instance.
Trivia, Sentiment, and Diacritization: Israeli trivia and sentiment analysis constructed via expert curation or professional linguists. Diacritization datasets involve manual gold annotation by linguists to ensure coverage of ambiguous phenomena.
Translation and Generation: Test sets are constructed from held-out, domain-diverse sentences, balanced for length and genre, with comparative evaluation against strong multilingual baselines.

Such datasets are repeatedly characterized by basic statistics (token/lemma vocabulary, compression/abstraction, redundancy, morphological indicators) and tailored for morphologically rich language (MRL) evaluation (Paz-Argaman et al., 2024, Cohen et al., 3 Aug 2025).

4. Evaluation Metrics and Methodologies

Hebrew LLM benchmarks innovate in both metric definition and evaluation protocol to ensure morphologically robust and linguistically pertinent assessment:

Token-Level Normalized Levenshtein Similarity (TLNLS) (Cohen et al., 3 Aug 2025): For extractive QA, TLNLS replaces EM/F1 by rewarding surface proximity at the character level, robust to Hebrew affixation and span variability:

$TLNLS(P,G) = \frac{1}{\max(|G|,|P|)} \sum_{g_i \in G} \max_{p_j \in P} ls(g_i,p_j)$

where $ls(s_1,s_2) = 1 - lev(s_1,s_2)/\max(|s_1|,|s_2|)$ and $lev$ denotes Levenshtein distance.

ROUGE, BertScore, and Semantic Similarity: For summarization, standard ROUGE-N ( $N=1,2$ ) and ROUGE-L metrics are used, but their limitations (e.g., negative correlation with human quality in Hebrew) are empirically shown (Paz-Argaman et al., 2024). BertScore is calculated with AlephBERT, a monolingual Hebrew backbone.
Pairwise Preference Judgments: For generative tasks (summarization, translation), models are compared directly by large LLM judges (GPT-4, GPT-4o), yielding win-rates against strong baselines in the absence of human gold references (Shmidman et al., 2 Feb 2026).
Human Evaluation: Likert-scale rating protocols (coherence, completeness, relevance, fluency) are employed, with explicit reporting of inter-annotator agreement (Krippendorff’s α, Cohen’s κ) (Paz-Argaman et al., 2024).
Morphology-Sensitive Intrinsic Metrics: Novel intrinsic metrics, such as redundancy (RED), compression ratio (CMPw), and morphology-aware probes (e.g., for diacritics, smixut, anaphors), are used to gauge more fine-grained error types (Paz-Argaman et al., 2024, Shmidman et al., 2 Feb 2026).
Standardized Reporting: Leaderboards mandate reporting on both surface (token/morpheme) and semantic metrics, and recommend α ≥ 0.75 for inter-annotator agreement in human evaluation (Paz-Argaman et al., 2024).

5. Empirical Findings and Model Performance

Comprehensive evaluations demonstrate several phenomena:

Reading Comprehension (HeQ): Multilingual models (mBERT) outperform monolingual AlephBERTs despite less Hebrew-specific pretraining, suggesting cross-lingual signal importance. TLNLS offers more realistic performance stratification than EM/F1, particularly for affixally rich answers (Cohen et al., 3 Aug 2025).
Abstractive Summarization (HeSum): Fine-tuned mLongT5 achieves highest ROUGE (≈17.5), but GPT-4 and GPT-3.5 exhibit superior BertScore (77.3, 77.0) and human scores (coherence and completeness), highlighting the inadequacy of n-gram overlap for capturing Hebrew summarization quality (Paz-Argaman et al., 2024).
Suite-Wide Baselines: Instruct-tuned DictaLM2.0/3.0 models outperform other open-weight and commercial LLMs on most Hebrew tasks. For instance, DictaLM-3.0-24B achieves diacritization accuracy of 86.86% (vs. 60.21% for Gemini-3-27B-it) and a summarization win rate >25 points above baseline (Shmidman et al., 2 Feb 2026).
Error Typology: All evaluated models exhibit Hebrew-specific error types, including incorrect gender/definiteness, improper diacritic restoration, over-copying (in seq2seq models), hallucinations, and cross-lingual transfer artifacts (such as gender misinflection due to English-style stereotypes) (Paz-Argaman et al., 2024, Shmidman et al., 2 Feb 2026).

Performance gaps are largest on morphology-intensive, Hebrew-only tasks, and when evaluation criteria appropriately reward semantic similarity in the presence of rich morphology.

6. Limitations and Recommendations for Future Extensions

The current Hebrew LLM benchmark landscape presents several challenges:

Dataset sizes are modest relative to English analogs (e.g., only 75 summarization samples in (Shmidman et al., 2024)), potentially limiting statistical significance for task differentiation.
Few-shot and zero-shot evaluation formats are prevalent, but do not assess model adaptivity under full fine-tuning scenarios.
Domains such as coreference resolution, code-switching, and conversational Hebrew remain underrepresented.
Full spectrums of robustness—including adversarial perturbations, social-media dialects, and noisy input—have yet to be systematically incorporated (Shmidman et al., 2024).

The benchmark community recommends expanding dataset scale and diversity, introducing additional task types (NLI, multi-turn dialogue), developing Hebrew-centric robustness tests, and integrating more morphology-aware and semantic metrics (e.g., COHESENTIA, MoverScore). Open-source releases of annotation protocols, metric scripts, and evaluation codebases are encouraged to standardize benchmarking practice (Paz-Argaman et al., 2024, Shmidman et al., 2024, Cohen et al., 3 Aug 2025).

7. Impact, Significance, and Open Benchmarks

The assembly of comprehensive Hebrew LLM benchmark suites has enabled the consistent, transparent evaluation of both sovereign and generalist LLMs on uniquely Hebrew tasks. The publication of leaderboards (e.g., https://huggingface.co/spaces/hebrew-LLM-leaderboard/leaderboard) accelerates community feedback, fosters model improvement, and sets state-of-the-art baselines for further research (Shmidman et al., 2024, Shmidman et al., 2 Feb 2026). The direct comparison of sovereign Hebrew-focused LLMs to large multilingual baselines demonstrates the importance of language-specific pretraining and tailored benchmarks, with implications for low-resource language modeling in general.

By integrating datasets such as HeSum, HeQ, and expert-curated evaluation resources under rigorously defined, morphology-aware protocols, the Hebrew LLM Benchmark Suite provides a principled foundation for both academic study and practical deployment of advanced NLU technology in Hebrew (Paz-Argaman et al., 2024, Shmidman et al., 2024, Shmidman et al., 2 Feb 2026, Cohen et al., 3 Aug 2025).