Curió 7B: Specialized Portuguese LLM
- Curió 7B is a large language model specialized for Portuguese that leverages a decoder-only transformer architecture with 32 layers.
- It employs continued pretraining on 100B tokens from a curated Portuguese corpus, leading to improved performance on benchmarks like PoETa V2.
- Targeted data filtering, evidenced by the superior results of Curió-Edu 7B, highlights the efficiency of quality over sheer data volume in language adaptation.
Curió 7B is a LLM specialized for Portuguese, representing the largest Portuguese-only continued-pretraining effort above the three-billion-parameter scale as of its release. Built on Meta’s open-source LLaMA-2 7B, Curió 7B is the result of sustained pretraining on a massive, filtered Portuguese corpus, offering a reference point for the comparative effectiveness of data volume and quality in language adaptation tasks (Almeida et al., 14 Dec 2025).
1. Model Architecture and Initialization
Curió 7B is directly initialized from the LLaMA-2 7B parameter set, employing a standard decoder-only transformer architecture:
- Layers: 32 transformer blocks
- Hidden dimension: 4096
- Feed-forward inner dimension: 11,008
- Attention heads: 32
- Rotary positional embeddings: Used for token position encoding
- Vocabulary: Approximately 32k tokens
- Masking: Standard causal masking
There are no modifications, ablations, or structural changes compared to the original LLaMA-2 7B specification; all core attention, embedding, and feed-forward layers are retained verbatim. The model thus preserves the performance and scaling properties of its predecessor while introducing new linguistic capabilities through continued pretraining in Portuguese [(Almeida et al., 14 Dec 2025), Sec. 1].
2. Continued-Pretraining Methodology
Curió 7B employs continued pretraining—a practice that leverages an existing large-scale model and extends its capabilities by training with additional, often linguistically or domain-targeted data.
2.1. Corpus Construction
- Source: ClassiCC-PT, a cleaned, deduplicated Portuguese web-crawl corpus, originally ~120B tokens
- Selection: 100B unique tokens sampled to cover diverse genres, including news, forums, academic writing, legal texts, and code
- Filtering: Each document is scored for STEM/education relevance using a pretrained classifier; standard Curió 7B does not filter on this dimension
2.2. Hyperparameters and Compute
- Sequence length: 4096 tokens with sequence packing
- Global batch size: 256 tokens (across data-parallel replicas)
- Optimizer: Adafactor, no weight decay
- Learning rate: Peak with cosine decay to zero
- Precision: Mixed (bfloat16/float32)
- Platform: TPU v2-256 nodes, T5x framework
- Compute: Approximately 7,000 TPU-v6–equivalent hours for 100B tokens
No architectural regularization modifications were introduced; defaults from the LLaMA-2 codebase govern training. The cross-entropy loss and perplexity monitoring are standard [(Almeida et al., 14 Dec 2025), Sec. 2.3–2.4].
3. Evaluation and Benchmarking
3.1. Benchmark Suite
Performance is assessed on PoETa V2 (Almeida et al. 2025), a unified benchmark for Portuguese spanning over 40 tasks across nine categories, including cultural reasoning, mathematics, text understanding, exams, social media, and general knowledge.
- Primary metric: Normalized Preferred Metric (NPM), mapping task-specific scores to the [0,100] range for consistent summarization [(Almeida et al., 14 Dec 2025), Sec. 3.1].
3.2. Aggregate Results
| Model | Tokens seen | Compute | PoETa V2 NPM |
|---|---|---|---|
| LLaMA-2 7B (base) | — | — | 29.2 |
| Curió-7B (100B tokens) | 100B | 7 000 h | 34.5 |
| Curió-Edu-7B (20B tokens) | 20B | 1 400 h | 37.6 |
Curió 7B improves over its initialization baseline by 5.3 NPM after seeing ~80B tokens. Notably, the educationally-filtered Curió-Edu 7B, trained on only one-fifth the data and compute, surpasses Curió 7B, achieving 37.6 NPM and showing faster convergence (32 NPM after just 5B tokens) [(Almeida et al., 14 Dec 2025), Table 1; Sec. 4.1].
3.3. Task Breakdown
Per-category results indicate Curió-Edu-7B yields the highest gains in Exams (+7.8), General Knowledge (+8.9), with consistent improvement across Reasoning, Ethics, and Brazil-specific cultural tasks [(Almeida et al., 14 Dec 2025), Sec. 4.2, Fig. 2].
4. Data Selection: Quantity Versus Quality
4.1. Filtering Criteria
Curió-Edu 7B is trained on a subset defined as:
where is the ClassiCC-PT STEM/education score for document (scale 0–5). The subset consists of 10B tokens, repeated for two epochs (20B tokens total) [(Almeida et al., 14 Dec 2025), Sec. 4.3].
4.2. Regimes Compared
- Data-constrained: Despite Curió-7B seeing 5× tokens (100B vs. 20B), Curió-Edu-7B outperforms across every PoETa V2 category.
- Compute-constrained: With both models at 20B tokens, Curió-Edu-7B leads by ~4.7 NPM, implying that semantic filtering offers a more effective adaptation route than random sampling under fixed computational budgets [(Almeida et al., 14 Dec 2025), Sec. 4.1–4.3].
A plausible implication is that data curation with an explicit semantic focus can optimize adaptation even in scenarios with extremely limited target-language exposure (Curió-7B’s base model contained only 0.01% Portuguese).
5. Comparative Scale and Generalization Insights
The 7B parameter scale is consequential: Curió-7B and Curió-Edu-7B both benefit from continued pretraining, but targeted data filtering yields especially clear improvements at this size. The same strategy applied to a 1.1B model exhibits more variable gains, suggesting a scale-dependent effect for curation efficacy [(Almeida et al., 14 Dec 2025), Sec. 5].
Continued pretraining with targeted, domain-relevant datasets is identified as a cost-effective adaptation approach for large, underrepresented languages. The main limitation remains corpus coverage and the granularity of selection thresholds. Future research directions outlined include multi-phase curricula (broad corpus pretraining followed by focused expert data) and systematic investigation of how filtering interacts with model capacity [(Almeida et al., 14 Dec 2025), Sec. 5].
6. Implications and Future Directions
The Curió 7B experiments constitute a critical case study on the quantitative and qualitative limits of continued pretraining for language adaptation. Key empirical conclusions:
- Data quality can outweigh quantity; filtered, specialized data enable superior performance with reduced compute and data requirements.
- No architectural changes are necessary to realize significant adaptation gains through curation.
- Finer-grained quality metrics or dynamic curricula may unlock further efficiency in future large model adaptation regimes.
- The paradigm holds practical potential for other underrepresented languages or domain-specific deployments, contingent on the availability of large, classified corpora.
These insights position Curió 7B, and specifically its Curió-Edu 7B variant, as archetypes for domain-focused LLM adaptation, underlining the competitive advantage of explicit data selection over brute-force data scaling (Almeida et al., 14 Dec 2025).