Papers
Topics
Authors
Recent
2000 character limit reached

Serbian-trained BERTić Model

Updated 6 January 2026
  • Serbian-trained BERTić is a Transformer-based encoder pre-trained on 8.39B words using ELECTRA's replaced-token detection, enhancing performance in Serbian NLP tasks.
  • The model leverages a diverse multilingual corpus with rigorous preprocessing that preserves diacritics and handles script duality for accurate language representation.
  • BERTić demonstrates strong downstream metrics on morphosyntactic tagging, NER, and synthetic QA, outperforming zero-shot baselines on Serbian benchmarks.

The Serbian-trained BERTić model is a Transformer-based encoder pre-trained specifically for Serbian and related Western South Slavic languages. Principally built on the ELECTRA base discriminator architecture with approximately 110 million parameters, BERTić offers competitive performance for a range of downstream Serbian NLP tasks, especially Question Answering (QA), Named Entity Recognition (NER), morphosyntactic tagging, and semantic sentiment analysis. This model capitalizes on a large multilingual corpus, strategic fine-tuning with synthetic datasets, and careful consideration of Serbian lexical and script duality.

1. Training Data Composition and Preprocessing

The pre-training regime for BERTić leveraged approximately 8.39 billion words drawn from Bosnian, Croatian, Montenegrin, and Serbian web sources, with nearly 1.96 billion tokens attributable to the Serbian language itself (Ljubešić et al., 2021). Three principal Serbian corpora underpinned this segment:

  • srWaC: 493 million words from a 2014 crawl of the .rs domain.
  • CLASSLA-sr: 752 million words from a 2019 crawl, deduplicated against srWaC.
  • cc100-sr: 711 million deduplicated words from CommonCrawl.

Preprocessing included sentence-level deduplication (removing ~15% in cc100), Unicode-preserving cleaning (diacritics retained), standard normalization (whitespace collapsing, control-character removal), and WordPiece subword segmentation. The tokenizer vocabulary was capped at 32,000 tokens, trained on a random sample of 10 million paragraphs. No language-specific stemming or aggressive normalization was applied, with diacritic preservation to accommodate Serbian orthographic variation (Škorić, 2024).

2. Model Architecture and Pre-training Objectives

BERTić employs an ELECTRA-based architecture with 12 transformer layers, hidden size of 768, 12 attention heads, and 3072-dimensional feed-forward networks (Ljubešić et al., 2021). The self-attention mechanism per layer is:

Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

with Q,K,VQ,K,V in Rn×dk\mathbb{R}^{n\times d_k}.

Its distinguishing feature is the replaced-token detection (RTD) objective rather than traditional masked language modeling (MLM): Ldisc=i=1L[yilogD(x~i)+(1yi)log(1D(x~i))]\mathcal{L}_{\rm disc} = -\sum_{i=1}^{L} [y_i\log D(\tilde{x}_i) + (1-y_i)\log(1-D(\tilde{x}_i))] where yi=1y_i=1 if x~i\tilde{x}_i is original, 0 if replaced.

No Next-Sentence Prediction (NSP) loss is used. Pretraining was executed over 8 TPUv3 cores, with a batch size of 1024, Adam optimizer (β1=0.9,β2=0.999,ϵ=1e6)(\beta_1=0.9, \beta_2=0.999, \epsilon=1e-6), weight decay 0.01, dropout 0.1, and a linear learning-rate warmup over 10% of 2 million steps, followed by decay to zero (Škorić, 2024).

3. Synthetic QA Dataset and Fine-tuning Procedures

For QA, BERTić was fine-tuned using SQuAD-sr, the largest synthetic Serbian QA dataset (87,175 samples), generated via an adapted Translate-Align-Retrieve (TAR) pipeline. The pipeline starts with SQuAD v1.1 triples, translating English contexts/questions using NLLB-200-1.3B NMT to Cyrillic, then optionally transliterating to Latin. Word alignments are established with eflomal, and answer spans are mapped via char2word and word2char indexes. A QA model was trained separately on both Cyrillic and Latin script versions to analyze script effects (Cvetanović et al., 2024).

Fine-tuning occurred in HuggingFace Transformers, single Tesla P100 GPU, batch size 16, learning rate η=3×105\eta=3 \times 10^{-5}, 3 epochs; QA heads were standard span classifiers (Cvetanović et al., 2024).

4. Downstream Evaluation and Comparative Performance

BERTić demonstrates strong performance across Serbian NLP benchmarks:

Task Dataset BERTić Score SOTA/Comparison
Morphosyntactic Tagging SETimes.SR (news) 96.31 µF1 cseBERT 96.41
Morphosyntactic Tagging ReLDI-sr (Twitter) 93.90 µF1 cseBERT 93.54
Named-Entity Recognition SETimes.SR (news) 92.02 F1 mBERT 92.41
Named-Entity Recognition ReLDI-sr (Twitter) 87.92 F1 mBERT 81.29
Social Geo-location SMG2020 37.96 km median cseBERT 40.76
QA (XQuAD) SQuAD-sr (Latin) 73.91% EM, 82.97% F1 Human: 82.3/91.2
Commonsense Reasoning COPA-HR 65.76% Accuracy cseBERT 61.8

On SQuAD-sr QA, BERTić fine-tuned on Latin script achieved 73.91% EM and 82.97% F1, outperforming mBERT (58.6/71.71) and XLM-R (61.08/73.94) zero-shot baselines. Latin script training led to a substantial ~18% EM gain over Cyrillic, attributed to WordPiece vocabulary skew favoring Latin (Cvetanović et al., 2024). Numeric/date questions were answered most accurately; span extraction for case/prepositional answers posed more difficulty.

5. Script Duality and Error Analysis

Serbian’s script duality was rigorously addressed by generating both Cyrillic and Latin parallel datasets. BERTić’s vocabulary skews toward Latin, explaining its superior performance on Latin-script fine-tuning (Latin: EM 73.91%, Cyrillic: EM 55.42%). Question-type error analysis revealed highest EM/F1 on “When” and “How many”, lowest on “What” and “Where”, rationalized by linguistic ambiguity and span extraction complexity (especially for genitive case/prepositional answers). This suggests that morphological features and script-specific pretraining strategies are pivotal for optimal QA performance (Cvetanović et al., 2024).

6. Model Variants, Additional Pretraining, and Comparative Landscape

Beyond the original, continued pretraining of XLM-RoBERTa (base/large) on combined HBS+Slovenian corpora (up to 11.5B tokens) delivers competitive performance, at times narrowing or surpassing BERTić’s baseline for Serbian NER, sentiment, and reasoning tasks (Ljubešić et al., 2024). BERTić sits in the "mid-rank" of vectorization models; larger or more monolingual RoBERTa variants may outperform it on sensitive tasks. A plausible implication is that fine-tuning on pure Serbian corpora or domain-specific data can close the human–machine gap, and that compute-efficient additional pretraining is an effective alternative to dedicated new models (Ljubešić et al., 2024, Škorić, 2024).

7. Conclusions and Research Directions

BERTić’s combination of monolingual pretraining, synthetic corpus generation (SQuAD-sr), and careful script management enables it to outperform zero-shot multilingual baselines in Serbian QA and several other NLP tasks. Despite this, an ~8% Exact Match gap to human-level QA remains, indicating that further improvements could derive from manually annotated datasets, incorporation of unanswerable questions, or adoption of larger/deeper architectures.

Future research is directed toward addressing morphosyntactic challenges (e.g., genitive case, preposition modeling), augmenting synthetic datasets with manual validation, exploiting cross-lingual co-training, and utilizing retrieval-augmented architectures as resource constraints allow (Cvetanović et al., 2024, Ljubešić et al., 2021, Škorić, 2024, Ljubešić et al., 2024).

The pre-trained Serbian BERTić model (“classla/bcms-bertic”) and associated tokenizers remain freely available for downstream fine-tuning and research through HuggingFace. This positions BERTić as an efficient, broadly accessible encoder for Serbian and Western South Slavic computational linguistics applications.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Serbian-trained BERTić Model.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube