DictaLM 2.0: Adapting Mistral-7B for Hebrew
- DictaLM 2.0 is a family of large language models designed for Modern Hebrew, extending Mistral-7B with a specialized bilingual training corpus of 200B tokens.
- It employs a two-stage training process involving embedding distillation and LM-head calibration to seamlessly integrate an expanded Hebrew tokenizer.
- Evaluation on a dedicated Hebrew benchmark demonstrates state-of-the-art performance in translation, sentiment analysis, and other NLP tasks.
DictaLM 2.0 is a family of LLMs adapted for Modern Hebrew, derived from the Mistral-7B-v0.1 foundation, and trained on a bilingual (Hebrew–English) corpus totaling approximately 200 billion tokens. This model family addresses the challenge of bringing state-of-the-art generative LLMing to low-resource languages via a methodology centered on tokenizer extension, embedding distillation, continuous pre-training, and instruct-tuning techniques. DictaLM 2.0 and DictaLM 2.0-Instruct are evaluated using a detailed Hebrew benchmark suite, establishing new standards for performance on multiple Hebrew NLP tasks (Shmidman et al., 9 Jul 2024).
1. Adaptation of Mistral-7B Architecture for Hebrew
The base of DictaLM 2.0 is the Mistral-7B-v0.1 open-source generative model. The training corpus comprises roughly equal parts Hebrew and English, amounting to 100 billion tokens per language. The adaptation process is nontrivial due to Hebrew's agglutinative morphology and distinct orthography, which are not optimally supported by default Byte Pair Encoding (BPE) tokenizers designed for English. The initial application of the Mistral tokenizer to Hebrew resulted in a high average of 5.81 tokens per word, which effectively tokenizes at nearly the character level and severely reduces modeling efficiency.
To address this, the tokenizer's vocabulary was expanded by 1,000 Hebrew-specific tokens. This intervention reduced Hebrew's token-per-word ratio by more than half, directly increasing modeling efficiency and capturing more semantically meaningful units. The extended vocabulary directly impacts memory usage, tokenization speed, and the semantic alignment between token representation and linguistic morphemes in Hebrew.
2. Tokenizer Extension, Embedding Distillation, and LM Head Calibration
The integration of new tokens necessitated a systematic two-stage training process to seamlessly align the expanded embedding space with the pretrained weights of the inherited Mistral model.
Stage 1: Embedding Distillation
- A corpus of 500,000 Hebrew sentences was selected.
- The objective function minimized the squared L2 distance between the last hidden state derived from the original tokenizer and that from the extended tokenizer, where and denote respective outputs.
- All model weights remained frozen except for the new embedding parameters, which were trained for one epoch at an initial learning rate of .
Stage 2: LM-Head Calibration
- With the new embeddings initialized, the LM head required adjustment for correct next-token prediction.
- Only the embedding layer and LM-head parameters were unfrozen.
- Training involved 100,000 documents for next-token prediction, again with a learning rate of .
This two-stage process ensures the new Hebrew tokens are embedded compatibly within the pretrained latent space, and that the generative head accurately produces the correct sequence of tokens, maintaining semantic alignment across both languages.
3. Bilingual Continuous Pre-training and Instructional Tuning
Following embedding and LM head adaptation, DictaLM 2.0 underwent continuous pre-training using an unsupervised next-token prediction objective over the entire 200B-token bilingual corpus.
- Training infrastructure consisted of 48 Intel Gaudi-2 chips.
- Sequence packing and the use of a document-attention causal mask prevented cross-document contamination during the attention calculation.
- This stage required one epoch to equilibrate the model parameters to the expanded token and embedding representations.
Instructional tuning for DictaLM 2.0-Instruct proceeded in two additional phases:
- Supervised Fine-Tuning (SFT): Curated dialogues in both Hebrew and English, including adapted OpenHermes and UltraChat datasets, were employed to promote task-generalization and instruction-following.
- Direct Preference Optimization (DPO): Training incorporated user feedback and reinforced language consistency (i.e., discouraging unintended code switching) using paired general/preference-affirming dialogue instances.
This layered methodology supports model generalization for both open-ended generation and explicit instruction adherence in multilingual contexts.
4. Hebrew LLM Benchmark Suite and Evaluation
A dedicated benchmark suite for Hebrew LLM evaluation was introduced to rigorously assess modeling capabilities across a spectrum of tasks:
Task | Dataset / Evaluation | Metric/Details |
---|---|---|
Question Answering | HeQ (1,436 entries) | 3-shot; TLNLS scoring |
Sentiment Analysis | Hebrew Sentiment (3,000 cases) | 3-class, few-shot |
Winograd Schema | Translated WSC | Few-shot reasoning |
Translation | NeuLabs-TedTalks | BLEU score; few-shot pairs |
Summarization | 75 news docs | GPT-4 judged summaries |
Evaluation used task-appropriate formats, such as few-shot prompts (three examples per context) and external automated or human-based grading (e.g., GPT-4 for summarization). DictaLM 2.0 and DictaLM 2.0-Instruct achieved state-of-the-art performance for 7B-parameter models on Winograd Schema and Translation benchmarks within this suite.
5. Broader Implications for Low-Resource and Multilingual NLP
The techniques developed for DictaLM 2.0 resolve key bottlenecks in extending LLMs to languages with limited resources.
- Tokenizer Extension and Embedding Distillation: These methods, centering on language- and corpus-specific vocabulary expansion and alignment loss minimization, are applicable to any language for which naïve BPE/WordPiece tokenization is inefficient.
- Continuous Pre-training from Strong Multilingual Initialization: By leveraging large monolingual and bilingual corpora with appropriately balanced proportions, model creators avoid data-inefficient and compute-intensive complete retraining.
- Instructional Fine-Tuning and Preference Optimization: The combined SFT+DPO pipeline, with explicit cross-lingual preferences, can be repurposed for other low-resource languages needing improved dialog consistency and code-switching management.
- Benchmark Suite as Research Scaffold: The comprehensive Hebrew benchmark provides a transferable template for constructing evaluation leaderboards in other non-English language contexts.
A plausible implication is the broader democratization of LLM capabilities, with reduced data and engineering requirements, for languages previously marginalized by mainstream NLP developments.
6. Model Availability, Community Impact, and Prospects
DictaLM 2.0 and DictaLM 2.0-Instruct are released openly, following the precedent set by the first-generation DictaLM (Shmidman et al., 2023), with both legacy and new models enhancing the Hebrew NLP ecosystem for both research and application development. The accessibility of these models, in conjunction with an open evaluation leaderboard, accelerates adoption, experimental reproducibility, and rapid domain-specific fine-tuning for both Modern and Historical Hebrew. Authorial intent, as stated, includes ongoing performance reporting and iterative improvement in data selection, training efficiency, and evaluation specificity.
The suite of methodologies introduced—spanning tokenizer adaptation, embedding alignment, mixed-objective pre-training, and rigorous benchmarking—provide a framework for scalable, high-performance LLM adaptation to any understudied language, with immediate impact on instruction-following, translation, question answering, sentiment analysis, and summarization for Modern Hebrew and beyond.