Reynaerde-7B-chat: Dutch LLM for Safe Dialogue
- Reynaerde-7B-chat is a Dutch-specific large language model designed for safe conversational dialogue using adapter-based QLoRA fine-tuning and Direct Preference Optimization.
- It is built on the LLaMA-2-7B-hf architecture with 6.9B parameters and utilizes a SentencePiece tokenizer with a 32K vocabulary, tailored for Dutch chat interactions.
- Empirical evaluations on Flemish narratives reveal issues of low coverage and domain mismatch, underscoring challenges in affect-sensitive sentiment analysis.
Reynaerde-7B-chat is a Dutch-specific LLM designed for safe conversational dialogue, built upon Meta’s LLaMA-2-7B-hf architecture and subsequently fine-tuned with adapter-based and preference-optimization methodologies. Its performance and limitations in psychological sentiment analysis tasks on Flemish spontaneous narratives offer insight into the challenges of LLM adaptation for low-resource, affect-rich language domains.
1. Base Model Architecture and Fine-Tuning Procedures
Reynaerde-7B-chat utilizes the LLaMA-2-7B-hf foundation model, comprising approximately 6.9 billion parameters structured within 32 Transformer blocks, each with a hidden size of 4 096 and 32 self-attention heads. Tokenization is handled by a SentencePiece-based BPE tokenizer supporting a vocabulary of approximately 32,000 tokens.
Adapter-based fine-tuning was conducted via quantized low-rank adaptation (QLoRA) adapters with rank 8, while base LLaMA-2 weights remained frozen throughout the process. 400,000 Dutch chat-style question–answer pairs were used for training, emphasizing safe dialogue responses sourced from diverse open Dutch conversational channels (e.g., social media, support forums). No additional data augmentation or prompt-tuning techniques were applied. Alignment was enforced through Direct Preference Optimization (DPO), targeting improved safety and conversational appropriateness in generation. Training schedules, batch sizes, and epoch counts are not specified in the reported implementation.
2. Experimental Design for Valence Prediction
The key evaluation concerned valence prediction on real-world Flemish narratives, collected as part of a psychology experiment with 102 native Dutch-speaking Belgian participants across 70 days. This resulted in 24,854 open-ended texts, each corresponding to the participants’ self-assessed valence score using a continuous slider ranging from –50 (“very unpleasant”) to +50 (“very pleasant”).
Both typed and transcribed spoken responses (automatic transcription applied to 1-minute voice recordings) were lightly cleaned—system tokens and whitespace were trimmed, with no lemmatization or custom Flemish normalization. Each input was formatted with a standardized English zero-shot prompt designed to elicit a numerical sentiment rating from Reynaerde-7B-chat (“rate its sentiment from 1 (very negative) to 7 (very positive)…Return ONLY a single numerical rating enclosed in brackets…”). No multilingual prompting or instruction finetuning was performed beyond this template.
3. Evaluation Metrics and Methodological Approach
Model outputs were compared against participants’ self-reported valence scores using correlation metrics. Despite presenting formulas for MAE and MSE, the paper exclusively reports Pearson and polyserial correlation coefficients, as outlined below:
- Mean Absolute Error (MAE):
- Mean Squared Error (MSE):
- Pearson correlation coefficient :
No cross-validation or train/test split was performed; each tool processed all available texts, and coverage (texts for which the model returns a prediction) was quantified. Inferential statistics included pairwise t-tests for correlation differences and significance testing for polyserial correlations.
4. Performance Outcomes: Quantitative and Comparative Analysis
Reynaerde-7B-chat’s empirical results are summarized as follows, in comparison with alternative models and tools:
| Model | Coverage | Pearson r | Polyserial r |
|---|---|---|---|
| LIWC (posemo) | 24,848/99.9% | 0.21 | 0.23 |
| LIWC (negemo) | 24,848/99.9% | –0.23 | –0.23 |
| Pattern.nl | 24,848/99.9% | 0.31 | 0.31 |
| ChocoLlama-8B-Instruct | 17,378/69.9% | 0.35 | 0.40 |
| GEITje-7B-ultra | 9,445/38.0% | 0.35 | 0.44 |
| Reynaerde-7B-chat | 446/1.8% | 0.18 | 0.24 |
Neither confidence intervals nor standard deviations for are reported. In few-shot English-prompt settings, Reynaerde-7B-chat achieved only with 29.0% coverage.
Coverage—the proportion of inputs yielding usable predictions—is a salient failure mode; Reynaerde-7B-chat supplied a valid rating for only 1.8% of inputs in the primary zero-shot setup.
5. Error Modes, Contextual Limitations, and Qualitative Insights
Several error characteristics are evident in the reported findings:
- Low coverage: The model frequently failed to return a rating, constraining its utility and rendering reliability metrics sensitive to selection bias.
- Restricted rating variance: Outputs clustered within the middle of the 1–7 scale (typically 3–5), rarely aligning with users’ self-assessed extremes.
- Domain mismatch: Reynaerde-7B-chat misinterpreted cultural markers and regionally marked Flemish vocabulary, including colloquial terms (“fuif,” “plezant”) and pronouns (“gij”), resulting in neutral or erroneous sentiment assignments.
- Pragmatics: The model systematically failed to detect sarcasm or pragmatic cues (“Mooi zo, alweer file op weg naar het werk, geweldig!”), defaulting to neutral rather than negative ratings.
A reported illustrative failure underscores these issues: a participant describing recurrent flooding (“Alwéér water in mijn kelder, fantastisch…”) rated their own sentiment as strongly negative (–30), but Reynaerde-7B-chat produced a middling positive score [4].
6. Interpretations, Recommendations, and Implications
Authors conclude that Reynaerde-7B-chat, despite sophisticated adapter-based fine-tuning and DPO alignment, underperforms against both lexicon-driven methods (Pattern.nl) and other Dutch LLMs in terms of coverage and credible valence prediction. Root causes are attributed to a pronounced domain mismatch—training on formal Dutch chat data fails to generalize to affective, regionally inflected Flemish narratives—and persistent English bias from the foundational LLaMA-2 pretraining.
Recommendations for future improvement include:
- Augmenting training corpora with colloquial, narrative-rich Flemish texts to reduce domain mismatch.
- Exploring hybrid models (“hybrid architectures” Editor’s term) that combine rule-based sentiment lexica for guaranteed coverage with LLM-derived contextual encoding.
- Establishing a gold-standard, manually annotated Flemish valence dataset to enable more nuanced benchmarking.
- Maintaining the use of lexicon-based methods for psychological tasks in low-resource language variants pending improvements in LLM adaptation and evaluation.
This suggests that current LLM architectures and adapter-based fine-tuning procedures may be insufficient for valence analysis in spontaneous, culturally nuanced domains, emphasizing the need for tailored domain adaptation and hybrid frameworks for reliable, context-sensitive sentiment modeling.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free