Serbian-trained BERTić Model

Updated 6 January 2026

Serbian-trained BERTić is a Transformer-based encoder pre-trained on 8.39B words using ELECTRA's replaced-token detection, enhancing performance in Serbian NLP tasks.
The model leverages a diverse multilingual corpus with rigorous preprocessing that preserves diacritics and handles script duality for accurate language representation.
BERTić demonstrates strong downstream metrics on morphosyntactic tagging, NER, and synthetic QA, outperforming zero-shot baselines on Serbian benchmarks.

The Serbian-trained BERTić model is a Transformer-based encoder pre-trained specifically for Serbian and related Western South Slavic languages. Principally built on the ELECTRA base discriminator architecture with approximately 110 million parameters, BERTić offers competitive performance for a range of downstream Serbian NLP tasks, especially Question Answering (QA), Named Entity Recognition (NER), morphosyntactic tagging, and semantic sentiment analysis. This model capitalizes on a large multilingual corpus, strategic fine-tuning with synthetic datasets, and careful consideration of Serbian lexical and script duality.

1. Training Data Composition and Preprocessing

The pre-training regime for BERTić leveraged approximately 8.39 billion words drawn from Bosnian, Croatian, Montenegrin, and Serbian web sources, with nearly 1.96 billion tokens attributable to the Serbian language itself (Ljubešić et al., 2021). Three principal Serbian corpora underpinned this segment:

srWaC: 493 million words from a 2014 crawl of the .rs domain.
CLASSLA-sr: 752 million words from a 2019 crawl, deduplicated against srWaC.
cc100-sr: 711 million deduplicated words from CommonCrawl.

Preprocessing included sentence-level deduplication (removing ~15% in cc100), Unicode-preserving cleaning (diacritics retained), standard normalization (whitespace collapsing, control-character removal), and WordPiece subword segmentation. The tokenizer vocabulary was capped at 32,000 tokens, trained on a random sample of 10 million paragraphs. No language-specific stemming or aggressive normalization was applied, with diacritic preservation to accommodate Serbian orthographic variation (Škorić, 2024).

2. Model Architecture and Pre-training Objectives

BERTić employs an ELECTRA-based architecture with 12 transformer layers, hidden size of 768, 12 attention heads, and 3072-dimensional feed-forward networks (Ljubešić et al., 2021). The self-attention mechanism per layer is:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

with $Q,K,V$ in $\mathbb{R}^{n\times d_k}$ .

Its distinguishing feature is the replaced-token detection (RTD) objective rather than traditional masked language modeling (MLM): $\mathcal{L}_{\rm disc} = -\sum_{i=1}^{L} [y_i\log D(\tilde{x}_i) + (1-y_i)\log(1-D(\tilde{x}_i))]$ where $y_i=1$ if $\tilde{x}_i$ is original, 0 if replaced.

No Next-Sentence Prediction (NSP) loss is used. Pretraining was executed over 8 TPUv3 cores, with a batch size of 1024, Adam optimizer $(\beta_1=0.9, \beta_2=0.999, \epsilon=1e-6)$ , weight decay 0.01, dropout 0.1, and a linear learning-rate warmup over 10% of 2 million steps, followed by decay to zero (Škorić, 2024).

3. Synthetic QA Dataset and Fine-tuning Procedures

For QA, BERTić was fine-tuned using SQuAD-sr, the largest synthetic Serbian QA dataset (87,175 samples), generated via an adapted Translate-Align-Retrieve (TAR) pipeline. The pipeline starts with SQuAD v1.1 triples, translating English contexts/questions using NLLB-200-1.3B NMT to Cyrillic, then optionally transliterating to Latin. Word alignments are established with eflomal, and answer spans are mapped via char2word and word2char indexes. A QA model was trained separately on both Cyrillic and Latin script versions to analyze script effects (Cvetanović et al., 2024).

Fine-tuning occurred in HuggingFace Transformers, single Tesla P100 GPU, batch size 16, learning rate $\eta=3 \times 10^{-5}$ , 3 epochs; QA heads were standard span classifiers (Cvetanović et al., 2024).

4. Downstream Evaluation and Comparative Performance

BERTić demonstrates strong performance across Serbian NLP benchmarks:

Task	Dataset	BERTić Score	SOTA/Comparison
Morphosyntactic Tagging	SETimes.SR (news)	96.31 µF1	cseBERT 96.41
Morphosyntactic Tagging	ReLDI-sr (Twitter)	93.90 µF1	cseBERT 93.54
Named-Entity Recognition	SETimes.SR (news)	92.02 F1	mBERT 92.41
Named-Entity Recognition	ReLDI-sr (Twitter)	87.92 F1	mBERT 81.29
Social Geo-location	SMG2020	37.96 km median	cseBERT 40.76
QA (XQuAD)	SQuAD-sr (Latin)	73.91% EM, 82.97% F1	Human: 82.3/91.2
Commonsense Reasoning	COPA-HR	65.76% Accuracy	cseBERT 61.8

On SQuAD-sr QA, BERTić fine-tuned on Latin script achieved 73.91% EM and 82.97% F1, outperforming mBERT (58.6/71.71) and XLM-R (61.08/73.94) zero-shot baselines. Latin script training led to a substantial ~18% EM gain over Cyrillic, attributed to WordPiece vocabulary skew favoring Latin (Cvetanović et al., 2024). Numeric/date questions were answered most accurately; span extraction for case/prepositional answers posed more difficulty.

5. Script Duality and Error Analysis

Serbian’s script duality was rigorously addressed by generating both Cyrillic and Latin parallel datasets. BERTić’s vocabulary skews toward Latin, explaining its superior performance on Latin-script fine-tuning (Latin: EM 73.91%, Cyrillic: EM 55.42%). Question-type error analysis revealed highest EM/F1 on “When” and “How many”, lowest on “What” and “Where”, rationalized by linguistic ambiguity and span extraction complexity (especially for genitive case/prepositional answers). This suggests that morphological features and script-specific pretraining strategies are pivotal for optimal QA performance (Cvetanović et al., 2024).

6. Model Variants, Additional Pretraining, and Comparative Landscape

Beyond the original, continued pretraining of XLM-RoBERTa (base/large) on combined HBS+Slovenian corpora (up to 11.5B tokens) delivers competitive performance, at times narrowing or surpassing BERTić’s baseline for Serbian NER, sentiment, and reasoning tasks (Ljubešić et al., 2024). BERTić sits in the "mid-rank" of vectorization models; larger or more monolingual RoBERTa variants may outperform it on sensitive tasks. A plausible implication is that fine-tuning on pure Serbian corpora or domain-specific data can close the human–machine gap, and that compute-efficient additional pretraining is an effective alternative to dedicated new models (Ljubešić et al., 2024, Škorić, 2024).

7. Conclusions and Research Directions

BERTić’s combination of monolingual pretraining, synthetic corpus generation (SQuAD-sr), and careful script management enables it to outperform zero-shot multilingual baselines in Serbian QA and several other NLP tasks. Despite this, an ~8% Exact Match gap to human-level QA remains, indicating that further improvements could derive from manually annotated datasets, incorporation of unanswerable questions, or adoption of larger/deeper architectures.

Future research is directed toward addressing morphosyntactic challenges (e.g., genitive case, preposition modeling), augmenting synthetic datasets with manual validation, exploiting cross-lingual co-training, and utilizing retrieval-augmented architectures as resource constraints allow (Cvetanović et al., 2024, Ljubešić et al., 2021, Škorić, 2024, Ljubešić et al., 2024).

The pre-trained Serbian BERTić model (“classla/bcms-bertic”) and associated tokenizers remain freely available for downstream fine-tuning and research through HuggingFace. This positions BERTić as an efficient, broadly accessible encoder for Serbian and Western South Slavic computational linguistics applications.

PDF Markdown Chat (Pro)

References (4)

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian (2021)

Novi jezički modeli za srpski jezik (2024)

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian (2024)

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Serbian-trained BERTić Model.

Serbian-trained BERTić Model

1. Training Data Composition and Preprocessing

2. Model Architecture and Pre-training Objectives

3. Synthetic QA Dataset and Fine-tuning Procedures

4. Downstream Evaluation and Comparative Performance

5. Script Duality and Error Analysis

6. Model Variants, Additional Pretraining, and Comparative Landscape

7. Conclusions and Research Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Serbian-trained BERTić Model

1. Training Data Composition and Preprocessing

2. Model Architecture and Pre-training Objectives

3. Synthetic QA Dataset and Fine-tuning Procedures

4. Downstream Evaluation and Comparative Performance

5. Script Duality and Error Analysis

6. Model Variants, Additional Pretraining, and Comparative Landscape

7. Conclusions and Research Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research