Papers
Topics
Authors
Recent
2000 character limit reached

ChocoLlama-8B-Instruct: Dutch LLM for Sentiment Analysis

Updated 17 November 2025
  • ChocoLlama-8B-Instruct is an instruction-tuned language model derived from LLaMA-3-8B, further adapted with Dutch-specific pretraining and alignment strategies.
  • It employs a combination of continued pretraining with rank-16 LoRA adapters, supervised fine-tuning, and Direct Preference Optimization to enhance sentiment prediction.
  • Evaluated on Flemish narratives, the model shows a modest Pearson correlation with self-report valence while suffering from limited coverage compared to traditional lexicon-based methods.

ChocoLlama-8B-Instruct is an instruction-tuned LLM derived from the LLaMA-3-8B family, specifically adapted for Dutch through further pretraining and alignment strategies. It was evaluated as part of a comparative paper on sentiment analysis in Flemish, a low-resource Dutch language variant, alongside both lexicon-based tools (LIWC, Pattern.nl) and other LLMs ("LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives" (Kandala et al., 10 Nov 2025)). Despite architectural advancements and Dutch-specific adaptation, it exhibited moderate correlation with user self-assessed valence in open-ended narrative texts but underperformed traditional lexicon approaches when applied to spontaneous real-world data.

1. Model Architecture and Pretraining

ChocoLlama-8B-Instruct is based on the LLaMA-3-8B architecture, a decoder-only Transformer comprising 8 billion parameters and 32 attention heads. Its Dutch adaptation process involved continued pretraining via a rank-16 LoRA adapter on a domain-diverse Dutch corpus containing 32 billion tokens drawn from legal documents, news, and social media. This phase was followed by supervised fine-tuning (SFT) on human-written instruction–response pairs in both Dutch and English, adhering to the LLaMA-3 SFT conventions.

Alignment was achieved using Direct Preference Optimization (DPO), distilling preferences stated over candidate model outputs into the final adapter weights. A significant proportion of the pretraining data remained English (approximately 90%), which contributed to the risk of domain mismatch when applying the model to informal Flemish narratives.

2. Instruction-Tuning for Valence Prediction

No explicit fine-tuning was conducted using valence-labeled data, yet ChocoLlama-8B-Instruct was indirectly influenced via Dutch instruction-tuning datasets, including social media posts, translation exercises, and question–answering tasks, alongside broad English multitask corpora. The instruction-tuning followed a cross-entropy objective for SFT and a pairwise ranking loss under DPO. Hyperparameters, as reported in Meeus et al. (2024), included a batch size of 128, learning rate of 1e-4 for LoRA weights, 3 epochs of continued pretraining, and a weight decay of 0.01.

Valence prediction during inference was a function of the model’s generic instruction-following abilities rather than any task-specific adaptation. No regression or classification loss targeting valence annotation was included during training.

3. Prompt Engineering and Inference Protocol

At inference, ChocoLlama-8B-Instruct and comparison models were deployed in zero-shot mode using a standardized English prompt template; Dutch instructions exhibited comparable results. The prompt was framed as follows:

1
2
3
4
5
6
You are a Dutch language expert analyzing the valence of Belgian Dutch texts.
Participants responded to: ‘What is going on now or since the last prompt, and how do you feel about it?’
Carefully read the response of the participant: {text}.
Your task is to rate its sentiment from 1 (very negative) to 7 (very positive).
Return ONLY a single numerical rating enclosed in brackets, e.g. [X], with no additional text.
Output Format: [number]
The {text} token was dynamically replaced with participant narratives. Few-shot prompting was tested but did not improve coverage or accuracy. This approach mapped narrative content to a coarse 1–7 sentiment scale, despite the original self-assessment ranging from –50 (very unpleasant) to +50 (very pleasant).

4. Dataset Composition and Preprocessing

The evaluation utilized 24,854 open-ended narrative responses produced by 102 native Dutch-speaking adults in Belgium (age range: 18–65, M = 26.47, SD = 8.87) collected over 70 days. Each participant submitted approximately four responses per day, either as manually typed messages (3–4 sentences) or 1-minute voice recordings, which underwent automatic transcription (KU Leuven ESAT).

Manual alignment was performed across five 14-day intervals, and entries shorter than 25 words or malformed transcripts were excluded. Participants provided valence ratings on a continuous –50 to +50 slider for each narrative. ChocoLlama-8B-Instruct returned a numerical valence for 17,378 texts, indicating 69.9% coverage. In contrast, Pattern.nl and LIWC yielded ratings for >99.9% of texts, reflecting their rule-based foundation and robustness to non-standard input.

Model Coverage Pearson r
Pattern.nl 99.9% 0.31
LIWC (PosEmo/NegEmo) 99.9% 0.21/–0.23
ChocoLlama-8B-Instruct 69.9% 0.35

Because Mean Squared Error (MSE) was not reported, performance comparison centers on Pearson correlation coefficients between model predictions and self-reported valence.

5. Quantitative and Qualitative Performance

On its processed data subset, ChocoLlama-8B-Instruct achieved a Pearson r of 0.35, modestly outperforming Pattern.nl (0.31) in terms of linear association with user ratings. However, this higher correlation is counterbalanced by its limited coverage: roughly 30% of texts received no rating, introducing selection bias and reducing ecological validity. LIWC performed less well, with r = 0.21 for positive emotion (PosEmo) and r = –0.23 for negative emotion (NegEmo) categories.

Several factors contributed to ChocoLlama-8B-Instruct’s relative underperformance:

  • Domain mismatch: The Dutch pretraining corpus lacked the first-person, spontaneous style intrinsic to emotional narratives in naturalistic experience sampling.
  • English-dominated token distribution: Exposure to colloquial Flemish and idiomatic Dutch was insufficient, amplifying the risk of missed cultural or linguistic nuances (e.g., “fuif,” “gij”).
  • Coverage limitations: Failure to generate a rating for many valid texts constrained reliability in large-scale psychological studies.
  • Sensitivity to negative affect: Lexicon-based methods (e.g., LIWC's NegEmo) explicitly count negative affect words, whereas LLMs like ChocoLlama-8B-Instruct may blur or under-detect such signals across narrative context.
  • Prompt scale misalignment: The 1–7 rating scale imposed on a native –50 to +50 spectrum forces quantization, flattening the granularity of extremely positive or negative responses.

6. Methodological Implications and Future Directions

The comparative deficits observed in ChocoLlama-8B-Instruct prompt several methodological recommendations for sentiment analysis in low-resource dialects:

  • Enrich pretraining with narrative corpora: Integrating large-scale, naturalistic, first-person diary or experience sampling texts may improve adaptation for emotional nuance.
  • Hybrid approaches: Combining LLM inference with psycholinguistic lexica (e.g., LIWC metrics) as auxiliary features or constraints may enhance sensitivity to both context and explicit emotional cues.
  • Targeted fine-tuning: Employing multi-task objectives that include direct regression to valence scores (e.g., MSE loss on –50 to +50) alongside instructions may yield superior mapping to ground-truth affect.
  • Standardized benchmarks: Creating and releasing shared Flemish valence-annotated corpora would support rigorous evaluation and accelerated model adaptation in this understudied variant.
  • Domain adaptation strategies: Techniques like unsupervised adaptation or continual learning may mitigate domain mismatch without requiring massive labeled datasets.

Until such linguistically and culturally tailored methodologies are operationalized, lightweight lexicon-based tools such as Pattern.nl maintain pragmatic advantages for scalable, ecologically valid valence annotation in low-resource language variants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ChocoLlama-8B-Instruct.