BERnaT: Encoder-Only Models for Basque
- BERnaT models are encoder-only language models designed for Basque, incorporating standard, dialectal, historical, and informal texts to bridge representational gaps.
- They use RoBERTa-style Transformer architectures across medium, base, and large configurations with masked language modeling pre-training and diverse corpora.
- Empirical results show that combined pre-training enhances robustness and inclusivity on both standard and diverse NLU tasks without sacrificing accuracy.
The BERnaT family is a suite of encoder-only LLMs designed for Basque, a morphologically rich and low-resource language. BERnaT advances the inclusion of linguistic diversity in pre-training corpora, incorporating not only standard texts but also dialectal, historical, and informal registers. The models address representational gaps common in LLMs trained exclusively on highly filtered corpora and aim to increase robustness and inclusivity across a wide spectrum of natural language understanding (NLU) tasks. BERnaT is architecturally grounded in RoBERTa-style Transformer encoders and extensively evaluates the effects of corpus composition on downstream task performance without compromising accuracy on standard benchmarks (Azurmendi et al., 3 Dec 2025).
1. Model Configurations and Corpus Diversity
BERnaT models are released in three configurations distinguished by pre-training data composition:
- BERnaTₛₜₐₙdₐᵣd: Trained on the latxa-corpus-v1.1 (4.3 million documents, 1.22 billion words) comprising predominantly standardized Basque sources such as Wikipedia, newspaper archives (Egunkaria), EusCrawl, Colossal-OSCAR, CulturaX, HPLT-v1, and Booktegi. Corpus-level diversity for these sources ranges from approximately 0.001 to 0.073, measured as the mean ratio of non-standard to total sentences.
- BERnaT_{diverse}: Trained exclusively on a diverse corpus (~13,000 social media documents and 338 historical book titles; 209 million words) including the Basque Social Media (BSM) dataset (11 million posts, 188 million words; diversity 0.14 ± 0.17) and the EKC collection (338 books, 21 million words; diversity 0.73 ± 0.17) capturing classical Basque.
- BERnaT (combined): Utilizes a union of latxa-corpus-v1.1 and the Diverse Corpora, for a total of approximately 1.43 billion words, reflecting the broadest coverage across standard and non-standard varieties.
Linguistic diversity is quantified by auto-tagging sentences as “standard” or “non-standard” (following Fernandez de Landa & Agerri, 2021), computing document diversity as the proportion of non-standard sentences, and aggregating for corpus-level metrics. This approach explicitly encompasses diatopic, diaphasic, diastratic, and diachronic variations in the Basque language.
2. Architecture, Pre-training Objectives, and Hyperparameters
All BERnaT models share a RoBERTa-style Transformer encoder backbone (Liu et al., 2019), instantiated in three model sizes:
| Size | Layers | Hidden Size | Attention Heads | Intermediate Size | Parameters |
|---|---|---|---|---|---|
| Medium | 6 | 512 | 8 | 2048 | ≈ 51M |
| Base | 12 | 768 | 12 | 3072 | ≈ 124M |
| Large | 24 | 1024 | 16 | 4096 | ≈ 355M |
The pre-training objective is Masked Language Modeling (MLM):
where is 15% of token positions sampled dynamically per sequence. Masking follows the RoBERTa recipe: 80% of masked tokens replaced by [MASK], 10% replaced at random, and 10% left unchanged. No Next Sentence Prediction or auxiliary objectives are employed. Regularization strategies mirror standard RoBERTa settings with dropout and weight decay.
Pre-training details include:
- Sequence length: 512 tokens (documents padded or chunked as necessary).
- Effective batch: 1 million tokens per epoch.
- Up to 100 training epochs, early stopping based on lowest validation MLM loss.
- Optimizer: AdamW with β-parameters as in RoBERTa.
- Learning rates: 8×10⁻⁴ (medium), 4×10⁻⁴ (base), 1×10⁻⁴ (large), held constant across configurations.
- Mixed precision (FP16) and Flash-Attention v2 on NVIDIA GPUs (Leonardo supercomputer).
- No curriculum learning or progressive schedule beyond standard linear warmup/decay.
3. Task Suite and Evaluation Methodology
An evaluation framework was designed to separately assess language generalization on “standard” and “diverse” task sources. Tasks are split as follows:
- Standard NLU tasks: Topic classification (BHTC), coreference (Korref), NER in-domain (NERCid), NER out-of-domain (NERCod), QA as NLI (QNLI), word-in-context (WiC), natural NLI (XNLIeu-nat), POS tagging (UD-POS).
- Diverse NLU tasks: Sentiment (BEC, Twitter), intent and slot filling (Facebook), stance (Vaxx, Twitter), dialectal NLI (XNLIeu-var), historical POS (POShis), and normalized historical POS (POShis-nor).
Evaluation metrics primarily comprise accuracy or F₁ score as appropriate. Each score is averaged over three random seeds, reporting mean and variance.
4. Comparative Results and Performance Trends
Aggregate accuracy results for all BERnaT variants and sizes are:
| Task subset | BERnaTₛₜ med | BERnaTₛₜ base | BERnaTₛₜ large | BERnaT_{div} med | BERnaT_{div} base | BERnaT_{div} large | BERnaT med | BERnaT base | BERnaT large |
|---|---|---|---|---|---|---|---|---|---|
| Avg standard | 74.10 | 75.33 | 76.83 | 71.66 | 72.44 | 74.48 | 73.56 | 75.42 | 77.88 |
| Avg diverse | 70.30 | 71.26 | 73.13 | 69.91 | 71.43 | 71.87 | 70.59 | 71.28 | 73.77 |
| Avg overall | 72.58 | 73.70 | 75.35 | 70.96 | 72.04 | 73.43 | 72.37 | 73.76 | 76.24 |
Key findings:
- BERnaT (combined) consistently achieves the highest accuracy on both standard and diverse subsets. For example, BERnaT large improves from 76.83→77.88 (standard) and 73.13→73.77 (diverse) compared to the standard-only variant.
- BERnaT_{diverse} exhibits higher scores on non-standard tasks (intent, stance, sentiment) at the expense of performance on standard benchmarks (2–3 percentage points lower).
- Incorporation of diverse corpora does not degrade standard benchmark accuracy.
- Model size correlates with performance gains from combined pre-training; the “large” models demonstrate the largest improvements, while “medium” models see modest or negligible gains.
5. Analysis: Generalization, Robustness, and Bias
Exposure to a range of dialectal, historical, and informal texts during pre-training leads to measurable robustness benefits, particularly on fine-tuning datasets from specialized or low-resource domains. The effect is most pronounced in large models. As model capacity increases, average task performance improves, with the greatest relative improvement due to combined pre-training observed at scale.
Fine-tuning data size critically interacts with pre-training diversity:
- For training sets under 10,000 examples, aligning pre-training and fine-tuning diversity can yield better results (specialized standard/diverse models may outperform the combined approach).
- With larger fine-tuning sets, combined BERnaT models outperform specialized variants across standard and diverse metrics.
Dialectal NLI benchmarking (XNLIeu-var) using large models shows central Basque at 75% accuracy, western at 72.5%, and navarrese at 66–69%. In this case, BERnaTₛₜₐₙdₐᵣd slightly outperforms combined due to the large fine-tuning dataset potentially overriding pre-training diversity-driven effects.
Challenging cases (strong code-switching, archaic orthography) remain difficult. The combined approach narrows, but does not eliminate, these performance gaps. No new unfair representational biases across dialects or registers were detected, suggesting broad-based, balanced corpora can mitigate skew.
6. Implications for Corpus Design and Model Development
The empirical findings support several broader recommendations:
- Inclusion of high-quality, naturally occurring “noisy” texts alongside standard corpora yields more robust and generalizable encoders, even in morphologically complex, low-resource settings.
- Document-level diversity quantification offers a principled mechanism for curating corpora to adequately sample linguistic varieties.
- No accuracy loss is observed on standard tasks from incorporating diverse data, contradicting a common concern in mainstream NLP model development.
A plausible implication is that related languages may also benefit from the described approach regardless of resource availability, provided corpus curation appropriately balances coverage and quality.
7. Societal and Ethical Considerations, Future Directions
The BERnaT initiative foregrounds equitable representation of all speaker communities—especially those associated with marginalized dialects—by actively including social media, pre-standard, and dialectal texts in pre-training corpora. Ethically, this counters the invisibility of non-standard language varieties in mainstream NLP pipelines.
Suggested future research includes:
- Extension of the methodology to decoder-only or encoder–decoder LLMs, targeting few-shot and zero-shot cross-dialect transfer.
- Generator-based evaluation metrics, such as fluency in dialectal text synthesis.
- Investigation into optimal mixing ratios of linguistic varieties to further enhance downstream generalization.
These directions underscore the broader impact of corpus design on both algorithmic fairness and the practical reach of language technology in diverse linguistic communities (Azurmendi et al., 3 Dec 2025).