GEITje-7B-ultra: Dutch Conversational LLM
- GEITje-7B-ultra is a Dutch conversational language model adapted from the Mistral 7B architecture with a tailored pretraining corpus.
- The model employs supervised fine-tuning and Direct Preference Optimization with synthetic Dutch data to enhance alignment and response quality.
- Evaluations show competitive performance in Dutch NLP and affective computing, while revealing challenges with regional and informal language nuances.
GEITje-7B-ultra is a Dutch conversational LLM based on the Mistral 7B architecture, further adapted using a Dutch-centric pretraining corpus and multiple alignment steps. Developed to enhance Dutch language generation capabilities and evaluated both as a general-purpose assistant and for affective computing tasks (notably valence prediction in low-resource Belgian-Dutch narratives), it embodies the recent focus on tailoring LLMs to culturally and linguistically diverse settings. All weights and datasets related to GEITje-7B-ultra are available under permissive open-source licensing.
1. Architectural Foundation and Adaptation
GEITje-7B-ultra is derived from Mistral-7B v0.1, a decoder-only transformer with approximately 7 billion parameters, standard multi-head self-attention, dense feed-forward layers, rotary position embeddings, and Grouped-Query Attention. All GEITje variants retain the original architecture—layer counts, hidden sizes, and attention head configurations remain as in Mistral-7B. The context window is set to 8192 tokens. Pretraining employs bfloat16 precision and FlashAttention 2, optimizing GPU memory and training efficiency (Vanroy, 5 Dec 2024).
The foundational Dutch-tuning phase involved a freeze-free full-parameter pretraining for approximately 10 billion Dutch-language tokens drawn from Wikipedia, news corpora, and web crawl. In contrast to adapter-based approaches, all weights are updated. The resulting Rijgersberg/GEITje-7B checkpoint informs further instruction tuning and alignment steps.
2. Supervised Fine-Tuning and Data Preparation
The model undergoes sequential adaptation: supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO). SFT comprises instruction-style Dutch data, primarily synthetic, generated by high-volume GPT-4 Dutch-language dialogues and further augmented by diverse user personas and topical breadth. The core SFT datasets and their proportions (by sample count) are:
| Dataset | Proportion (%) |
|---|---|
| ultrachat_200k_dutch | 85.42 |
| stackoverflow-chat-dutch | 8.38 |
| no_robots_dutch | 2.20 |
| alpaca-cleaned-dutch | 2.62 |
| dolly-15k-dutch | 1.39 |
Synthetic fine-tuning data use post-generation language identification, exclusion of non-Latin scripts, and removal of meta-knowledge, apology, or model-reflective content.
The SFT step spans 240,527,565 tokens (train) and 26,397,086 tokens (test), trained for one epoch with a batch size of 16 per GPU, cosine learning-rate decay (initial LR 2 × 10⁻⁵), and no gradient accumulation. All training was on 2 nodes × 4 NVIDIA A100 80GB (Vlaams Supercomputer), with gradient checkpointing enabled and a maximum sequence length of 8192 tokens.
3. Direct Preference Optimization Alignment
GEITje-7B-ultra’s alignment utilizes DPO, a method to increase the probability gap between preferred ("chosen") and rejected model responses, defined mathematically as:
DPO is performed on Ultra Feedback Dutch (cleaned), a synthetic, GPT-4-judged feedback set. Only response pairs scored highly (average rating ≥ 4.0 out of 5.0 across Dutch-ness, helpfulness, conciseness; minimum per-criterion 3.5; inter-response difference ≥ 0.25) are retained, yielding 48,228 pairs (train) and 5,359 (test).
The DPO alignment step uses: batch size 4 per GPU (gradient accumulation 4), β = 0.1 (selected to minimize repetition/hallucination), learning rate 5 × 10⁻⁷, context length 8192, and one epoch of training. The recommended model checkpoint is BramVanroy/GEITje-7B-ultra (Vanroy, 5 Dec 2024).
4. Evaluation in Sentiment and Dutch-language Tasks
GEITje-7B-ultra’s capabilities are evaluated on two axes: broad NLP benchmarks and affective computing with spontaneous Belgian-Dutch narratives.
General NL Benchmarks—ScandEval Subset:
| Model | Avg. Score |
|---|---|
| gpt-4-1106-preview | 62.32 |
| mistralai/Mistral-7B-v0.1 | 39.10 |
| BramVanroy/GEITje-7B-ultra | 35.37 |
| Rijgersberg/GEITje-7B | 35.22 |
| Rijgersberg/GEITje-7B-chat-v2 | 34.77 |
Benchmarks cover CoNLL-NL (NER), Dutch Social (sentiment), ScaLA NL (acceptability), SQuAD NL (QA), WikiLingua NL (summarization), MMLU NL (MCQ), and HellaSwag NL (commonsense). DPO shows modest gains in commonsense, with qualitative fluency and response conciseness improved post-alignment, yet substantial performance gaps with GPT-4 persist.
Valence Prediction on Belgian-Dutch Narratives (Kandala et al., 10 Nov 2025):
- Dataset: 24,854 text responses from 102 participants, transcribed and/or typed, each labeled by a self-assigned valence score (–50 to +50).
- Zero-shot prompting with a fixed English instruction template expecting numerical output on a 1–7 scale.
- Coverage: GEITje-7B-ultra returns outputs for 9,445 texts (38.0%), considerably lower than lexicon tools (Pattern.nl and LIWC, ≈99.9%).
- Pearson’s r for valence correlation (model on its covered subset): 0.35 (polyserial r: 0.44). Pattern.nl, applied to all texts, achieves r = 0.31.
- No significant difference between GEITje’s and Pattern’s r (t = –1.20, p = 0.23).
- Lexicon-based methods, despite lacking contextual modeling, maintain better coverage and competitive correlations for this task.
5. Limitations and Qualitative Findings
Domain adaptation remains the most salient challenge. Although GEITje-7B-ultra is pretrained on 10 billion Dutch tokens, these derive largely from formal and social media sources, insufficiently capturing first-person colloquial and regional vernacular—an issue revealed by poor handling of Flemish idioms and region-specific lexis (e.g., "fuif," "gij"). The base English-centric inheritance further compounds this mismatch, manifesting in LLM output gaps and inability to robustly process informal affect. In practical evaluation, coverage gaps (outputs missing for over 60% of narratives) suggest both capacity and calibration issues for real-world, low-resource, emotionally nuanced input.
Lexicon tools retain advantages in affective cue sensitivity via their psycholinguistically validated word lists and explicit negation/intensifier algorithms. This enables robust extraction of affect even in the presence of colloquial or elliptical language, whereas LLMs tuned on general-purpose or synthetic Dutch data are less adaptable.
6. Distribution, Usage Recommendations, and Prospects
GEITje-7B-ultra and related datasets are accessible via the Hugging Face Model Hub under open licenses. The DPO-aligned model is recommended by its developers as the principal Dutch LLM conversational assistant, with the "ultra-sft" checkpoint provided as an intermediate artifact.
Key recommendations for future advancement include:
- Extending pretraining and alignment corpora to encompass Belgian Flemish narrative forms (journals, diaries, transcribed conversations).
- Integrating hybrid scoring mechanisms—prompting or feature fusion with established lexicon resources such as LIWC—to complement LLM context modeling.
- Establishing a public, expert-annotated Flemish valence benchmark to facilitate model adaptation and evaluation.
- Employing data-efficient alignment (e.g., multi-task distillation, few-shot in-domain prompting) to bolster both model coverage and calibration in low-resource scenarios.
- Exploring the addition of cultural-region specific lexica to mitigate domain and dialectal gaps.
Current limitations, including the synthetic nature of feedback for preference alignment, the absence of ablation studies on SFT dataset composition, and incomplete benchmarking for generative fluency, delimit the model’s applications and reliability for nuanced Dutch-language affective computing. Further research will be necessary for GEITje-7B-ultra and similar LLMs to surpass lexicon approaches for spontaneous first-person narrative sentiment analysis at scale.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free