ChocoLlama-8B-Instruct: Dutch LLM for Sentiment Analysis

Updated 17 November 2025

ChocoLlama-8B-Instruct is an instruction-tuned language model derived from LLaMA-3-8B, further adapted with Dutch-specific pretraining and alignment strategies.
It employs a combination of continued pretraining with rank-16 LoRA adapters, supervised fine-tuning, and Direct Preference Optimization to enhance sentiment prediction.
Evaluated on Flemish narratives, the model shows a modest Pearson correlation with self-report valence while suffering from limited coverage compared to traditional lexicon-based methods.

ChocoLlama-8B-Instruct is an instruction-tuned LLM derived from the LLaMA-3-8B family, specifically adapted for Dutch through further pretraining and alignment strategies. It was evaluated as part of a comparative study on sentiment analysis in Flemish, a low-resource Dutch language variant, alongside both lexicon-based tools (LIWC, Pattern.nl) and other LLMs ("LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives" (Kandala et al., 10 Nov 2025)). Despite architectural advancements and Dutch-specific adaptation, it exhibited moderate correlation with user self-assessed valence in open-ended narrative texts but underperformed traditional lexicon approaches when applied to spontaneous real-world data.

1. Model Architecture and Pretraining

ChocoLlama-8B-Instruct is based on the LLaMA-3-8B architecture, a decoder-only Transformer comprising 8 billion parameters and 32 attention heads. Its Dutch adaptation process involved continued pretraining via a rank-16 LoRA adapter on a domain-diverse Dutch corpus containing 32 billion tokens drawn from legal documents, news, and social media. This phase was followed by supervised fine-tuning (SFT) on human-written instruction–response pairs in both Dutch and English, adhering to the LLaMA-3 SFT conventions.

Alignment was achieved using Direct Preference Optimization (DPO), distilling preferences stated over candidate model outputs into the final adapter weights. A significant proportion of the pretraining data remained English (approximately 90%), which contributed to the risk of domain mismatch when applying the model to informal Flemish narratives.

2. Instruction-Tuning for Valence Prediction

No explicit fine-tuning was conducted using valence-labeled data, yet ChocoLlama-8B-Instruct was indirectly influenced via Dutch instruction-tuning datasets, including social media posts, translation exercises, and question–answering tasks, alongside broad English multitask corpora. The instruction-tuning followed a cross-entropy objective for SFT and a pairwise ranking loss under DPO. Hyperparameters, as reported in Meeus et al. (2024), included a batch size of 128, learning rate of 1e-4 for LoRA weights, 3 epochs of continued pretraining, and a weight decay of 0.01.

Valence prediction during inference was a function of the model’s generic instruction-following abilities rather than any task-specific adaptation. No regression or classification loss targeting valence annotation was included during training.

3. Prompt Engineering and Inference Protocol

At inference, ChocoLlama-8B-Instruct and comparison models were deployed in zero-shot mode using a standardized English prompt template; Dutch instructions exhibited comparable results. The prompt was framed as follows:

You are a Dutch language expert analyzing the valence of Belgian Dutch texts.
Participants responded to: ‘What is going on now or since the last prompt, and how do you feel about it?’
Carefully read the response of the participant: {text}.
Your task is to rate its sentiment from 1 (very negative) to 7 (very positive).
Return ONLY a single numerical rating enclosed in brackets, e.g. [X], with no additional text.
Output Format: [number]

The {text} token was dynamically replaced with participant narratives. Few-shot prompting was tested but did not improve coverage or accuracy. This approach mapped narrative content to a coarse 1–7 sentiment scale, despite the original self-assessment ranging from –50 (very unpleasant) to +50 (very pleasant).

4. Dataset Composition and Preprocessing

The evaluation utilized 24,854 open-ended narrative responses produced by 102 native Dutch-speaking adults in Belgium (age range: 18–65, M = 26.47, SD = 8.87) collected over 70 days. Each participant submitted approximately four responses per day, either as manually typed messages (3–4 sentences) or 1-minute voice recordings, which underwent automatic transcription (KU Leuven ESAT).

Manual alignment was performed across five 14-day intervals, and entries shorter than 25 words or malformed transcripts were excluded. Participants provided valence ratings on a continuous –50 to +50 slider for each narrative. ChocoLlama-8B-Instruct returned a numerical valence for 17,378 texts, indicating 69.9% coverage. In contrast, Pattern.nl and LIWC yielded ratings for >99.9% of texts, reflecting their rule-based foundation and robustness to non-standard input.

Model	Coverage	Pearson r
Pattern.nl	99.9%	0.31
LIWC (PosEmo/NegEmo)	99.9%	0.21/–0.23
ChocoLlama-8B-Instruct	69.9%	0.35

Because Mean Squared Error (MSE) was not reported, performance comparison centers on Pearson correlation coefficients between model predictions and self-reported valence.

5. Quantitative and Qualitative Performance

On its processed data subset, ChocoLlama-8B-Instruct achieved a Pearson r of 0.35, modestly outperforming Pattern.nl (0.31) in terms of linear association with user ratings. However, this higher correlation is counterbalanced by its limited coverage: roughly 30% of texts received no rating, introducing selection bias and reducing ecological validity. LIWC performed less well, with r = 0.21 for positive emotion (PosEmo) and r = –0.23 for negative emotion (NegEmo) categories.

Several factors contributed to ChocoLlama-8B-Instruct’s relative underperformance:

Domain mismatch: The Dutch pretraining corpus lacked the first-person, spontaneous style intrinsic to emotional narratives in naturalistic experience sampling.
English-dominated token distribution: Exposure to colloquial Flemish and idiomatic Dutch was insufficient, amplifying the risk of missed cultural or linguistic nuances (e.g., “fuif,” “gij”).
Coverage limitations: Failure to generate a rating for many valid texts constrained reliability in large-scale psychological studies.
Sensitivity to negative affect: Lexicon-based methods (e.g., LIWC's NegEmo) explicitly count negative affect words, whereas LLMs like ChocoLlama-8B-Instruct may blur or under-detect such signals across narrative context.
Prompt scale misalignment: The 1–7 rating scale imposed on a native –50 to +50 spectrum forces quantization, flattening the granularity of extremely positive or negative responses.

6. Methodological Implications and Future Directions

The comparative deficits observed in ChocoLlama-8B-Instruct prompt several methodological recommendations for sentiment analysis in low-resource dialects:

Enrich pretraining with narrative corpora: Integrating large-scale, naturalistic, first-person diary or experience sampling texts may improve adaptation for emotional nuance.
Hybrid approaches: Combining LLM inference with psycholinguistic lexica (e.g., LIWC metrics) as auxiliary features or constraints may enhance sensitivity to both context and explicit emotional cues.
Targeted fine-tuning: Employing multi-task objectives that include direct regression to valence scores (e.g., MSE loss on –50 to +50) alongside instructions may yield superior mapping to ground-truth affect.
Standardized benchmarks: Creating and releasing shared Flemish valence-annotated corpora would support rigorous evaluation and accelerated model adaptation in this understudied variant.
Domain adaptation strategies: Techniques like unsupervised adaptation or continual learning may mitigate domain mismatch without requiring massive labeled datasets.

Until such linguistically and culturally tailored methodologies are operationalized, lightweight lexicon-based tools such as Pattern.nl maintain pragmatic advantages for scalable, ecologically valid valence annotation in low-resource language variants.

PDF Markdown Chat (Pro)

References (1)

LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ChocoLlama-8B-Instruct.

ChocoLlama-8B-Instruct: Dutch LLM for Sentiment Analysis

1. Model Architecture and Pretraining

2. Instruction-Tuning for Valence Prediction

3. Prompt Engineering and Inference Protocol

4. Dataset Composition and Preprocessing

5. Quantitative and Qualitative Performance

6. Methodological Implications and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ChocoLlama-8B-Instruct: Dutch LLM for Sentiment Analysis

1. Model Architecture and Pretraining

2. Instruction-Tuning for Valence Prediction

3. Prompt Engineering and Inference Protocol

4. Dataset Composition and Preprocessing

5. Quantitative and Qualitative Performance

6. Methodological Implications and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research