MEHDIE Hebrew-Arabic Benchmark

Updated 18 January 2026

The paper details a comprehensive framework for evaluating MT systems through the synthesis, curation, and validation of Arabic–Hebrew parallel corpora.
MEHDIE categorizes data into spoken transcriptions, subtitles, and domain-specific corpora, ensuring reproducible train/dev/test splits and transparent preprocessing methods.
MT architectures benchmarked include phrase-based SMT and neural MT, with techniques like tokenization, UNK-replacement, and attention-based charCNN enhancing translation quality.

The MEHDIE Hebrew-Arabic Benchmark is a standardized suite and methodological framework for evaluating machine translation (MT) systems and resources between Arabic and Hebrew. It is anchored in the synthesis, curation, and empirical validation of parallel corpora and robust evaluation protocols for both phrase-based statistical and neural MT, with a focus on reproducibility, domain coverage, and extensibility across research and application domains (Belinkov et al., 2016).

1. Parallel Corpora and Resource Specification

MEHDIE distinguishes three principal categories of Arabic–Hebrew parallel data: spoken language transcription, subtitle collections, and domain-specific/evaluation sets. Each corpus is characterized by domain, size, alignment method, normalization/tokenization protocol, and licensing status. The primary corpora are detailed below:

Corpus	Sentences	Domain
WIT $^3$ (TED)	200 K	Talk transcripts
OpenSubtitles	14.6 M	Movie/TV subtitles
OpenSubtitles-Alt	9.5 M	Alternative subtitle lines
GNOME	600 K	Software localization
KDE	80.5 K	Software localization
Ubuntu	51.3 K	Software localization
Shilon et al.	1.6 K	News (eval only)
Tatoeba	0.9 K	Community examples
GlobalVoices	76	News articles

Alignment methods include automatic caption-to-sentence joining (WIT $^3$ ), HunAlign-based subtitle alignment (OpenSubtitles), TMX translation memories (software localization), and manual alignment for curated news and evaluation sets (Shilon et al.). Arabic text is normalized with Unicode procedures, with ATB-style clitic splitting via MADAMIRA or Farasa applied during training; Hebrew receives minimal normalization and punctuation separation. Licensing varies: WIT $^3$ (CC-BY), OpenSubtitles (CC-BY-SA), OPUS-derived localization (MIT/GPL), Tatoeba (CC0), Shilon (research-only) (Belinkov et al., 2016).

2. Machine Translation Architectures and Methodologies

Two core MT approaches are benchmarked: Phrase-Based Statistical MT (PBMT) and Neural MT (NMT).

Phrase-Based SMT (PBMT):
- Decoding objective:
$\hat{e} = \arg\max_{e} P(f \mid e)^{\lambda_1}P(e)^{\lambda_2}$ - Word alignments are generated via fast_align (symmetrized with grow-diag-final-and), phrase tables up to 7 tokens, and lexicalized msd-bidirectional-fe reordering. The target-side LLM is a 5-gram KenLM with modified Kneser-Ney smoothing. Model weights are tuned with MERT to optimize BLEU. Sentences longer than 80 tokens are discarded (Belinkov et al., 2016).
Neural MT (NMT):
- Architecture: Attention-based encoder–decoder, stacked LSTMs (configurations: 2×500 units and 4×1000 units).
- Model probability:
$P(y \mid x) = \prod_{t=1}^T \text{softmax}(W_s\,s_t)$

with $s_t$ denoting the decoder state, incorporating the previous decoder state, attention context, and embedding of $y_{t-1}$ . - Input representation: Character-level convolutional neural networks (charCNN) replace standard word embeddings, using 100-dim character embeddings, 4 filter widths (1–4), 100 filters per width, followed by max-pooling. Sub-word modeling with charCNN was sufficient; no BPE experiments were reported. - Vocabulary: 50 000 most frequent tokens per language. - Preprocessing employs ATB tokenization (Arabic) and punctuation splitting (Hebrew) (Belinkov et al., 2016).

3. Experimental Setup, Evaluation Protocols, and Results

The canonical split for WIT $^3$ (TED) comprises 200K training pairs, 7.3K development, and 0.9K test sentences, with held-out data from 2015–16 reserved for future validation.

PBMT training: fast_align, KenLM (5-gram), MERT for BLEU.
NMT training: SGD (learning rate 1.0), batch size 64, maximum sentence length 50, beam search width 5, early stopping on dev perplexity.

Evaluation uses BLEU (multi-bleu.perl, case-sensitive), Meteor Universal v1.5 (utilizing phrase-table as paraphrase resource), and perplexity (language-model or cross-entropy loss).

System	BLEU	Meteor	PPL
PBMT (baseline)	9.31	32.30	478.4
PBMT + Farasa ATB	9.51	33.38	335.5
PBMT + MADAMIRA ATB	9.63	32.90	342.5
NMT (2×500LSTM)	9.91	30.55	2.275
NMT (4×1000LSTM)	9.92	30.46	2.214
NMT + UNK-replacement	10.12	31.84	2.275
NMT + charCNN	10.65	32.43	2.239
NMT + charCNN + UNK-replacement	10.86	33.61	2.239

UNK-replacement operates by copying the highest-attention source word to the target, reducing OOV errors (Belinkov et al., 2016).

4. Data Design, Tokenization, and Evaluation Best Practices

Key corpora—WIT $^3$ (TED), Shilon’s news set, GNOME/KDE/Ubuntu (localization), and optionally OpenSubtitles—should be integrated, supporting reproducible train/dev/test splits per domain and cross-domain generalization.
Multi-reference test sets using OpenSubtitles-Alt are recommended to achieve robust BLEU and Meteor measurements.
Arabic consistently benefits from ATB-style tokenization (clitic splitting), regardless of MT paradigm. Hebrew received only punctuation splitting in reported experiments; further morphological analysis is a noted area for future research.
For PBMT, parameter tuning by MERT/PRO is essential; for NMT, charCNN or BPE mitigates vocabulary and morphology issues, and UNK-replacement (guided by attention weights) addresses rare word translation (Belinkov et al., 2016).

5. Recommendations for MEHDIE Benchmark Construction

The MEHDIE registry should encompass:

All high-quality parallel corpora with accurate domain labels and licensing verifications.
Multiple reference translations where feasible.
Standardized data splits per domain, plus dedicated generalization tracks.
Preprocessed and raw text releases, with explicit normalization/tokenization scripts.
Evaluation protocols employing BLEU (case-sensitive) and Meteor with multi-reference capability.

This suggests that explicit separation of preprocessing, transparent licensing, and rigorous evaluation splits are central to MEHDIE’s reproducibility and interoperability mandates.

6. Prospective Extensions and Open Challenges

Proposed MEHDIE extensions include the addition of a Hebrew→Arabic track and the collection or curation of symmetrical corpora for reverse translation. Further, the injection of morphosyntactic annotations (e.g., POS tags) into development and test sets is advised to foster targeted error analysis. The provision of both preprocessed and raw text further supports transparency, replication, and diverse downstream research use cases (Belinkov et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Large-Scale Machine Translation between Arabic and Hebrew: Available Corpora and Initial Results (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MEHDIE Hebrew-Arabic Benchmark.