belabBERT: Dutch RoBERTa for Clinical NLP
- belabBERT is a monolingual Dutch language model based on RoBERTa-base that leverages a large, web-crawled corpus and a custom tokenizer to capture long-range semantic dependencies.
- The model outperforms prior Dutch models in standard sentiment analysis and psychiatric interview classification, achieving test accuracies up to 75.68% with robust per-label F1 scores.
- A hybrid text-and-audio late-fusion framework further improves accuracy for underrepresented classes, highlighting its potential for practical clinical applications.
belabBERT is a monolingual Dutch LLM built on the RoBERTa-base Transformer architecture and trained from scratch on a large Dutch web-crawled corpus. Its primary motivation is to improve the capture of long-range semantic dependencies in Dutch text, particularly for difficult language understanding tasks such as the classification of psychiatric illness from interview transcripts. belabBERT leverages Dutch-specific data and tokenization strategies, empirically outperforming prior Dutch models on both standard sentiment analysis and clinically relevant classification tasks (Wouts et al., 2021, Wouts, 2020).
1. Model Architecture and Pre-training Objectives
belabBERT adopts the RoBERTa-base transformer configuration, consisting of 12 encoder layers, each with 12 self-attention heads, hidden size , and feed-forward size , resulting in approximately 125 million parameters. The sequence length is capped at 512 tokens, controlled by a Dutch-specific Byte-Pair Encoding (BPE) tokenizer with a 50,265-token vocabulary. All activation functions are GeLU, and dropout is applied at (Wouts et al., 2021, Wouts, 2020).
Pre-training follows a masked language modeling (MLM) objective without Next Sentence Prediction. For masked positions , the loss is:
Pre-training uses the Dutch OSCAR corpus: 41GB of raw text, reduced to a cleaned 32GB corpus post filtering—fuzzy deduplication (lines with >90% overlap), removal of URLs, and exclusion of lines >2000 tokens. The BPE tokenizer is trained from scratch on this corpus, avoiding artifacts of English-centric tokenizations and preserving Dutch sentence structure (Wouts et al., 2021).
Optimization employs Adam or AdamW with weight decay 0.01, RoBERTa default hyperparameters: learning rate linearly warms up to , then decays; batch sizes are sequences per GPU; total training time is 60 hours on 16 Nvidia Titan RTX GPUs, covering approximately 125,000 training steps (Wouts et al., 2021, Wouts, 2020).
2. Dutch-specific Innovations: Corpus, Tokenizer, and Pre-processing
belabBERT's improvements over earlier Dutch models, notably RobBERT, derive from (a) its use of a Dutch-crawled corpus with original, non-shuffled document order, and (b) a tokenizer trained specifically on Dutch data (Wouts et al., 2021, Wouts, 2020). Preserving sentence and document order during pre-training allows self-attention to capture dependencies across sentence boundaries, an essential property for long-form or context-sensitive language modeling.
The custom Dutch BPE tokenizer reduces the tokenization artifacts associated with using a tokenizer trained on English data, resulting in fewer tokens per Dutch sentence and improved handling of the language's morphological richness.
3. Fine-tuning for Psychiatric Interview Classification
Fine-tuning employs a standard BERT-classification head: the [CLS] token from the final layer undergoes dropout, then is processed by a linear classifier to produce logits for each class (psychotic, depressed, healthy), followed by a softmax and cross-entropy loss. The interview corpus comprises 141 manually transcribed interviews, with sample label distributions stratified by psychiatric diagnosis (Wouts et al., 2021, Wouts, 2020).
Transcripts are chunked into either 220- or 505-token segments, yielding separate datasets for experimental comparison. Hyperparameter sweeps cover batch size (4–16), learning rate (–), and epochs (3–7), selecting the configuration with lowest validation cross-entropy.
Splits are applied at the speaker level: 80% train, 10% validation, and 10% test. For the optimal 220-token chunk setting, the test distribution is: 63 psychotic, 59 healthy, and 6 depressed segments (Wouts, 2020).
4. Comparative Evaluation: belabBERT, RobBERT, Audio-only, and Hybrid Models
Experimental results demonstrate substantial gains for belabBERT over RobBERT and audio baselines (see Table 1).
Table 1. Psychiatric Classification Accuracy [%]
| Model | Chunk Size | Validation | Test |
|---|---|---|---|
| belabBERT | 220 | 71.18 | 75.68 |
| belabBERT | 505 | 70.25 | 73.91 |
| RobBERT | 220 | 69.64 | 69.06 |
| RobBERT | 505 | 68.93 | 65.69 |
belabBERT achieves a 6.62 percentage point improvement in test accuracy over RobBERT for 220-token chunks. Per-label test F1 for belabBERT-220: psychotic (80.58%), healthy (77.70%), depressed (22.21%). The low F1 for depression is attributed to severe class imbalance (n=6 test samples). Metrics demonstrate that the model specializes in distinguishing psychotic and healthy subjects, and rarely misclassifies others as depressed (Wouts et al., 2021, Wouts, 2020).
The audio-only classifier uses openSMILE eGeMAPS features and a 3-layer MLP (94–64–32), yielding 65.96% test accuracy and F1=0 for depression, confirming limited utility of unimodal audio features for this minority class.
5. Hybrid Text-and-Audio Late-Fusion Framework
A late-fusion hybrid architecture concatenates the softmax logits of the frozen belabBERT and audio classifiers into a single 6-dimensional vector, which is passed through a linear layer and softmax for final prediction. Only the fusion layer's parameters are trained, using the same transcript-aligned splits.
This hybrid approach yields a further increase in test accuracy (77.70%), with strong per-label F1—psychotic (81.82%), healthy (80.01%), and a substantial recovery for depression (52.94%), suggesting that audio features complement text inputs specifically for the underrepresented class (Wouts et al., 2021, Wouts, 2020).
Table 2. Hybrid Classifier Metrics (Test Set)
| Label | Recall | Precision | F1 |
|---|---|---|---|
| Depressed | 60.00 | 47.37 | 52.94 |
| Healthy | 81.25 | 78.79 | 80.01 |
| Psychotic | 78.26 | 85.71 | 81.82 |
A plausible implication is that even limited audio signal can help mitigate recall issues in rare-class detection within clinical interview modeling.
6. Benchmarking and Real-world Applications
In addition to psychiatric classification, belabBERT demonstrates superior performance on standard Dutch NLP benchmarks. On the DBRD Dutch book review sentiment dataset, belabBERT achieves 95.92% accuracy, exceeding RobBERT (94.42%) and BERTje (93.00%) (Wouts et al., 2021).
Key applications include eHealth screening tools for early detection of psychosis and depression, automated triage for online helplines, and multimodal digital biomarker frameworks in psychiatric assessment.
7. Limitations and Prospective Directions
Current pre-training exploits only ~60% of the available corpus due to computational constraints; further training could yield incremental gains. The classification pipeline is limited by the small and imbalanced depressed class, and by the use of fixed-length chunking, possibly affecting discourse-level context modeling. Adaptive chunking or hierarchical segment modeling may address these issues.
No formal statistical significance testing is reported, but the consistent cross-validation/test performance and the magnitude of gains over baselines suggest robust improvements. belabBERT’s Dutch-centric data and tokenizer confer advantages for semantic dependency modeling, particularly for clinical language tasks involving structured interviews (Wouts et al., 2021, Wouts, 2020).