ClassiCC-PT Corpus for Portuguese LLM Training

Updated 21 December 2025

ClassiCC-PT Corpus is a large-scale, domain-annotated Portuguese text resource that supports both general LLM pretraining and specialized adaptation for education and STEM.
Its construction uses industrial-scale cleaning, language filtering, deduplication, and semantic annotation to ensure high data quality and relevance.
Evaluation reveals that semantic filtering significantly improves model data efficiency and performance, as evidenced by lower perplexity in domain-specific subsets.

The ClassiCC-PT Corpus is a large-scale, domain-annotated, and semantically filtered Portuguese-language textual resource developed for effective continued pretraining of LLMs in Portuguese. Constructed via industrial-scale cleaning of Common Crawl data and augmented with domain labels for education and STEM relevance, it is designed to support both broad-coverage and specialized adaptation of transformer models, particularly for high-fidelity natural language understanding and generation in the Portuguese language (Almeida et al., 14 Dec 2025). The corpus underpins advances in dialog systems and educational models, and its construction procedure epitomizes state-of-the-art practices in data curation and semantic filtering for LLM pretraining.

1. Corpus Composition and Statistics

The ClassiCC-PT Corpus comprises approximately $N_{\text{docs}}\approx 30\times 10^6$ documents and a total token count $N_{\text{total}}=120\times 10^9$ . Documents were sourced primarily from the Common Crawl, reflecting the diversity and scale needed for robust LLM adaptation. The corpus includes multiple domain-tagged subsets suitable for domain-adaptive pretraining.

Domain	Tokens ( $N_i$ )	Percentage ( $p_i$ )
General web	$72\times 10^9$	$60\%$
News and press	$9.6\times 10^9$	$8\%$
Social-media style	$8.4\times 10^9$	$7\%$
Educational material	$20\times 10^9$	$16.7\%$
STEM content	$10\times 10^9$	$8.3\%$

A curated 10-billion-token educational/STEM subset is also available, comprising $8\times 10^9$ tokens in educational documents and $2\times 10^9$ in STEM texts, representing 80% and 20% of this semantic slice, respectively.

2. Data Acquisition and Preprocessing

Raw text extraction follows the CCNet/Trafilatura pipeline, emphasizing high-yield Portuguese document retrieval. Boilerplate and irrelevant content are eliminated using the GoldMiner/Justext algorithm. The following preprocessing steps are enforced:

Minimum document length: $L_{\min}=50$ tokens; documents longer than $L_{\max}=2000$ tokens are split or truncated.
Language filtering: Documents are retained only if $P(\text{lang}=\text{pt})>0.95$ using a fastText classifier for high confidence in Portuguese language identification.
Deduplication relies on 9-gram locality sensitive hashing, discarding pairs with Jaccard similarity exceeding $S_{\max}=0.8$ , where only the shorter or higher-quality page is retained.
Tokenization is standardized to the LLaMA-2 BPE tokenizer (vocabulary size: 32k).

These steps yield a corpus with minimal redundancy and high language purity, supporting robust downstream model performance.

3. Semantic Annotation and Subset Selection

Each corpus document receives real-valued “educational relevance” ( $s_{\text{edu}}$ ) and “STEM relevance” ( $s_{\text{stem}}$ ) scores from logistic regression classifiers trained on annotated Portuguese corpora. For the CuriO-Edu training set, selection is performed with the semantic filter $s=\max(s_{\text{edu}},s_{\text{stem}})>2.5$ , yielding $N_{\text{edu-subset}}=10\times 10^9$ tokens. This enables both large-scale general pretraining and highly curated expert-domain continued pretraining for LLMs.

4. Corpus-level Quality Assessment

The ClassiCC-PT Corpus was evaluated for vocabulary coverage and LLM perplexity:

Vocabulary coverage ( $C$ ) on a 5M-token test set is approximately $92\%$ , with coverage defined as

$C = \frac{|\mathrm{types}(\text{test}) \cap \mathcal{V}|}{|\mathrm{types}(\text{test})|}\times100\%,$

where $\mathcal{V}$ is the LLaMA-2 BPE vocabulary.

Perplexity on the unfiltered corpus is $\mathrm{PPL}_{\mathrm{full}}\approx28.3$ , while on the educational/STEM filtered subset, perplexity falls to $\mathrm{PPL}_{\mathrm{edu}}\approx24.1$ . This reduction is indicative of more “learnable” and less noisy text, confirming that the semantic filter amplifies data quality.

5. Impact on LLM Continued Pretraining

Continued pretraining of LLaMA-2 7B on the full ClassiCC-PT corpus (Curió-7B) versus the 10B-token educational/STEM subset (Curió-Edu-7B) reveals that semantic filtering yields superior data efficiency and end-task performance. Curió-Edu-7B surpasses 32 NPM (Normalized Perplexity Metric) after $5\times 10^9$ tokens, whereas the full-corpus model requires nearly $80\times 10^9$ tokens to reach the same level. Peak NPM is 36.3 for Curió-Edu-7B, compared to 34.5 for the unfiltered baseline, demonstrating that targeted semantic selection can outweigh sheer corpus size for domain-adapted LLMs, even when pretraining from a minimal Portuguese base (Almeida et al., 14 Dec 2025).

6. Relation to Dialogue and Affective Modeling

While the mainline ClassiCC-PT corpus is focused on general and domain-specific continued pretraining, it is important to distinguish the “positively transitioned sentiment dialogue corpus” of the same name (Wang et al., 2022). This resource comprises 67,205 multi-turn Twitter dialogues (302,475 utterances) labeled for explicit transitions from negative or neutral to positive sentiment. constructed to augment open-domain chatbot models with the ability to recognize and actively steer affective state transitions in conversation. Dialogue labeling is achieved by the intersection (“voting” scheme) of BERT-based sentiment prediction (trained on Google Play Store App reviews) and a 1,008-entry emoji-to-sentiment mapping. Cohen’s $\kappa\approx 0.75$ on a held-out sample demonstrates substantial reliability. This dialogue corpus is integral for affective model fine-tuning, as evidenced by improved metrics in emotion-affective chatbot benchmarks (Wang et al., 2022).

7. Availability and Use Cases

ClassiCC-PT and its domain-labeled subsets are publicly released resources suitable for a broad range of NLP tasks, including but not limited to: LLM continued pretraining for Portuguese, educational content generation, STEM question answering, and affective dialog modeling. Pretrained models such as Curió-7B and Curió-Edu-7B are accessible at dedicated repositories.

A plausible implication is that the combination of large-scale, domain-filtered, and semantically labeled language resources embodied by ClassiCC-PT establishes new best practices for resource construction in underrepresented languages and specialized application domains (Almeida et al., 14 Dec 2025, Wang et al., 2022).