ChocoLlama-8B-Instruct: Dutch LLM for Sentiment Analysis
- ChocoLlama-8B-Instruct is an instruction-tuned language model derived from LLaMA-3-8B, further adapted with Dutch-specific pretraining and alignment strategies.
- It employs a combination of continued pretraining with rank-16 LoRA adapters, supervised fine-tuning, and Direct Preference Optimization to enhance sentiment prediction.
- Evaluated on Flemish narratives, the model shows a modest Pearson correlation with self-report valence while suffering from limited coverage compared to traditional lexicon-based methods.
ChocoLlama-8B-Instruct is an instruction-tuned LLM derived from the LLaMA-3-8B family, specifically adapted for Dutch through further pretraining and alignment strategies. It was evaluated as part of a comparative paper on sentiment analysis in Flemish, a low-resource Dutch language variant, alongside both lexicon-based tools (LIWC, Pattern.nl) and other LLMs ("LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives" (Kandala et al., 10 Nov 2025)). Despite architectural advancements and Dutch-specific adaptation, it exhibited moderate correlation with user self-assessed valence in open-ended narrative texts but underperformed traditional lexicon approaches when applied to spontaneous real-world data.
1. Model Architecture and Pretraining
ChocoLlama-8B-Instruct is based on the LLaMA-3-8B architecture, a decoder-only Transformer comprising 8 billion parameters and 32 attention heads. Its Dutch adaptation process involved continued pretraining via a rank-16 LoRA adapter on a domain-diverse Dutch corpus containing 32 billion tokens drawn from legal documents, news, and social media. This phase was followed by supervised fine-tuning (SFT) on human-written instruction–response pairs in both Dutch and English, adhering to the LLaMA-3 SFT conventions.
Alignment was achieved using Direct Preference Optimization (DPO), distilling preferences stated over candidate model outputs into the final adapter weights. A significant proportion of the pretraining data remained English (approximately 90%), which contributed to the risk of domain mismatch when applying the model to informal Flemish narratives.
2. Instruction-Tuning for Valence Prediction
No explicit fine-tuning was conducted using valence-labeled data, yet ChocoLlama-8B-Instruct was indirectly influenced via Dutch instruction-tuning datasets, including social media posts, translation exercises, and question–answering tasks, alongside broad English multitask corpora. The instruction-tuning followed a cross-entropy objective for SFT and a pairwise ranking loss under DPO. Hyperparameters, as reported in Meeus et al. (2024), included a batch size of 128, learning rate of 1e-4 for LoRA weights, 3 epochs of continued pretraining, and a weight decay of 0.01.
Valence prediction during inference was a function of the model’s generic instruction-following abilities rather than any task-specific adaptation. No regression or classification loss targeting valence annotation was included during training.
3. Prompt Engineering and Inference Protocol
At inference, ChocoLlama-8B-Instruct and comparison models were deployed in zero-shot mode using a standardized English prompt template; Dutch instructions exhibited comparable results. The prompt was framed as follows:
1 2 3 4 5 6 |
You are a Dutch language expert analyzing the valence of Belgian Dutch texts.
Participants responded to: ‘What is going on now or since the last prompt, and how do you feel about it?’
Carefully read the response of the participant: {text}.
Your task is to rate its sentiment from 1 (very negative) to 7 (very positive).
Return ONLY a single numerical rating enclosed in brackets, e.g. [X], with no additional text.
Output Format: [number] |
{text} token was dynamically replaced with participant narratives. Few-shot prompting was tested but did not improve coverage or accuracy. This approach mapped narrative content to a coarse 1–7 sentiment scale, despite the original self-assessment ranging from –50 (very unpleasant) to +50 (very pleasant).
4. Dataset Composition and Preprocessing
The evaluation utilized 24,854 open-ended narrative responses produced by 102 native Dutch-speaking adults in Belgium (age range: 18–65, M = 26.47, SD = 8.87) collected over 70 days. Each participant submitted approximately four responses per day, either as manually typed messages (3–4 sentences) or 1-minute voice recordings, which underwent automatic transcription (KU Leuven ESAT).
Manual alignment was performed across five 14-day intervals, and entries shorter than 25 words or malformed transcripts were excluded. Participants provided valence ratings on a continuous –50 to +50 slider for each narrative. ChocoLlama-8B-Instruct returned a numerical valence for 17,378 texts, indicating 69.9% coverage. In contrast, Pattern.nl and LIWC yielded ratings for >99.9% of texts, reflecting their rule-based foundation and robustness to non-standard input.
| Model | Coverage | Pearson r |
|---|---|---|
| Pattern.nl | 99.9% | 0.31 |
| LIWC (PosEmo/NegEmo) | 99.9% | 0.21/–0.23 |
| ChocoLlama-8B-Instruct | 69.9% | 0.35 |
Because Mean Squared Error (MSE) was not reported, performance comparison centers on Pearson correlation coefficients between model predictions and self-reported valence.
5. Quantitative and Qualitative Performance
On its processed data subset, ChocoLlama-8B-Instruct achieved a Pearson r of 0.35, modestly outperforming Pattern.nl (0.31) in terms of linear association with user ratings. However, this higher correlation is counterbalanced by its limited coverage: roughly 30% of texts received no rating, introducing selection bias and reducing ecological validity. LIWC performed less well, with r = 0.21 for positive emotion (PosEmo) and r = –0.23 for negative emotion (NegEmo) categories.
Several factors contributed to ChocoLlama-8B-Instruct’s relative underperformance:
- Domain mismatch: The Dutch pretraining corpus lacked the first-person, spontaneous style intrinsic to emotional narratives in naturalistic experience sampling.
- English-dominated token distribution: Exposure to colloquial Flemish and idiomatic Dutch was insufficient, amplifying the risk of missed cultural or linguistic nuances (e.g., “fuif,” “gij”).
- Coverage limitations: Failure to generate a rating for many valid texts constrained reliability in large-scale psychological studies.
- Sensitivity to negative affect: Lexicon-based methods (e.g., LIWC's NegEmo) explicitly count negative affect words, whereas LLMs like ChocoLlama-8B-Instruct may blur or under-detect such signals across narrative context.
- Prompt scale misalignment: The 1–7 rating scale imposed on a native –50 to +50 spectrum forces quantization, flattening the granularity of extremely positive or negative responses.
6. Methodological Implications and Future Directions
The comparative deficits observed in ChocoLlama-8B-Instruct prompt several methodological recommendations for sentiment analysis in low-resource dialects:
- Enrich pretraining with narrative corpora: Integrating large-scale, naturalistic, first-person diary or experience sampling texts may improve adaptation for emotional nuance.
- Hybrid approaches: Combining LLM inference with psycholinguistic lexica (e.g., LIWC metrics) as auxiliary features or constraints may enhance sensitivity to both context and explicit emotional cues.
- Targeted fine-tuning: Employing multi-task objectives that include direct regression to valence scores (e.g., MSE loss on –50 to +50) alongside instructions may yield superior mapping to ground-truth affect.
- Standardized benchmarks: Creating and releasing shared Flemish valence-annotated corpora would support rigorous evaluation and accelerated model adaptation in this understudied variant.
- Domain adaptation strategies: Techniques like unsupervised adaptation or continual learning may mitigate domain mismatch without requiring massive labeled datasets.
Until such linguistically and culturally tailored methodologies are operationalized, lightweight lexicon-based tools such as Pattern.nl maintain pragmatic advantages for scalable, ecologically valid valence annotation in low-resource language variants.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free