Curió-Edu 7B: Portuguese LLM Pretraining
- Curió-Edu 7B is a 7-billion-parameter LLM designed for Brazilian Portuguese, employing a decoder-only Transformer architecture derived from LLaMA-2 7B.
- It uses a semantically filtered subset of the ClassiCC-PT corpus, focusing on high-quality STEM and educational texts with a 20B token pass to enhance domain-specific performance.
- Evaluation on the PoETa V2 benchmark shows a 14% relative gain over the full-corpus variant, demonstrating the efficiency and effectiveness of focused pretraining.
Curió-Edu 7B is a 7-billion-parameter LLM designed to examine the effects of targeted data selection in continued pretraining for domain and language-specific adaptation. Developed as a variant of Curió 7B, it is built upon LLaMA-2 7B [Touvron et al., 2023] and specialized for Brazilian Portuguese by pretraining exclusively on a high-quality, educational and STEM-filtered corpus. This model stands as the most extensive Portuguese-centric continued pretraining project above the three-billion-parameter scale and provides empirical evidence for the critical impact of semantic data selection, even at large scales (Almeida et al., 14 Dec 2025).
1. Model Architecture and Initialization
Curió-Edu 7B is a decoder-only Transformer initialized from the open-source LLaMA-2 7B checkpoint. The architecture consists of 32 Transformer layers, each with a hidden size of 4096, 32 attention heads, a maximum sequence length of 4096, and RoPE (rotary positional) embeddings. The base LLaMA-2 7B weights were pretrained on approximately two trillion tokens, with only about 0.05% of that data in Portuguese. Continued pretraining thus plays a pivotal role in shifting the model’s distribution toward high-quality Portuguese text, addressing the pronounced under-representation of the target language (Almeida et al., 14 Dec 2025).
2. Corpus Construction and Semantic Filtering
Curió-Edu 7B leverages a semantically filtered subset of the ClassiCC-PT corpus, in contrast to its sibling Curió 7B, which uses the full corpus. The ClassiCC-PT corpus contains 100 billion unique Portuguese tokens obtained from cleaned and deduplicated Common Crawl web snapshots, subjected only to language filtering. For Curió-Edu 7B, a strict semantic filter is applied: documents must score at least 2.5 (on a 0–5 scale) for STEM/Education relevance according to domain-specific ClassiCC classifiers. The resulting 10 billion-token subset heavily features textbooks, lecture notes, tutorials, STEM blogs, and academic articles, yielding higher pedagogical quality and reduced noise relative to the broader corpus (Almeida et al., 14 Dec 2025). Only this curated subset is used for Curió-Edu 7B, representing 10% of the full corpus.
3. Training Regime and Optimization Parameters
Both Curió and Curió-Edu 7B models are continued-pretrained from LLaMA-2 7B under compute- and data-constrained protocols. Curió 7B is exposed to a single epoch of the full 100 billion tokens, whereas Curió-Edu 7B undergoes two epochs over its 10 billion curated tokens (20 billion token passes in total). Training utilizes the Adafactor optimizer [Shazeer & Stern, 2018], with a peak learning rate of and cosine decay to zero. Training is conducted with a global batch size of 256 sequences in mixed precision, using sequence packing for optimal context utilization. Compute resources are allocated on TPU v2-256 pods, with estimated costs of $7,000 for Curió 7B and$1,400 for Curió-Edu 7B (assuming TPU v6 pricing). Both employ standard autoregressive cross-entropy loss:
where denotes sequence length and the model parameters (Almeida et al., 14 Dec 2025).
4. Evaluation Benchmarks and Core Results
Performance is evaluated using PoETa V2, a comprehensive Portuguese testbed covering over 40 tasks spanning domains such as Brazilian culture, standardized exams, reasoning, mathematics, ethics, and common sense. Metrics from these varied benchmarks are unified into the Normalized Preferred Metric (NPM), which normalizes task scores to a 0–100 scale.
Comparison of NPM results for the relevant 7B-parameter models, with both compute and data budget held at 20 billion tokens, shows:
| Model | Token Exposure | NPM | Absolute Gain () | Relative Gain (%) |
|---|---|---|---|---|
| Curió 7B | 20 B (random) | 32.9 | reference | reference |
| Curió-Edu 7B | 20 B (2×10 B) | 37.6 | +4.7 | ≈ 14 |
| Curió 7B | 100 B | 34.6 | – | – |
| LLaMA-2 7B Base | – | 29.2 | – | – |
Curió-Edu 7B, despite training on only 20% of the compute or 10% of the data of Curió 7B, outperforms all tested models, including the full-corpus variant. In domain-specific subcategories, the following NPM improvements over Curió 7B (random, 20 B tokens) are observed: Exams (+7.8), Brazil (+4.0), Math (+3.8), Ethics (+1.7), with positive gains across all nine largest PoETa V2 categories (Almeida et al., 14 Dec 2025).
5. Insights on Data Quality Versus Data Quantity
Learning curves indicate Curió-Edu 7B achieves rapid early gains, surpassing 32 NPM within only 5 billion tokens, whereas Curió 7B requires approximately 30 billion tokens to reach the same level. Ablation experiments under fixed compute budgets demonstrate that targeted semantic filtering consistently outperforms random sampling. Notably, while the 7B parameter models fully leverage high-quality curated data across domains, smaller 1.1B-parameter variants (Curio 1.1B vs Curio-Edu 1.1B) show mixed results, with some domains benefitting and others regressing. This suggests that larger models derive greater benefit from focused, high-quality data, while smaller models may require broader data diversity for generalization. Consistent multi-domain improvements and stable learning curves (±0.1 NPM oscillations) reinforce the robustness of these effects, although no formal significance testing is reported (Almeida et al., 14 Dec 2025).
6. Practical Recommendations and Model Release
Continued pretraining on a semantically curated corpus is shown to be more compute-efficient and more effective than naively increasing total data volume in the context of low-resource languages and specialized domains. Recommended strategies for practitioners include: developing lightweight classifiers to tag high-value documents (for educational, legal, medical domains); composing domain-focused data mixtures to guide adaptation; and aligning compute budgets to support multiple passes over high-quality curated data rather than single-pass over massive, noisy corpora.
Curió-Edu 7B is publicly available, including model weights, tokenizer, inference scripts, and licensing/hardware documentation, at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu. The distribution is under an Apache 2.0 license, facilitating downstream use and further research (Almeida et al., 14 Dec 2025).
7. Significance and Outlook
Curió-Edu 7B empirically demonstrates that data quality can surpass mere data quantity in continued pretraining for LLMs, particularly when adapting to low-resource languages such as Portuguese. Using just 10 billion curated STEM/education tokens (with double-pass exposure for 20 billion total), Curió-Edu 7B outperforms a 100 billion-token unfiltered baseline in every major evaluation dimension, using only 20% of the compute of the naive scaling approach. These findings provide a concrete, resource-efficient protocol for tailoring LLMs using targeted semantic filtering, informing both practical deployments and future research directions in efficient language and domain adaptation (Almeida et al., 14 Dec 2025).