ClassiCC-PT Corpus for Portuguese LLM Training
- ClassiCC-PT Corpus is a large-scale, domain-annotated Portuguese text resource that supports both general LLM pretraining and specialized adaptation for education and STEM.
- Its construction uses industrial-scale cleaning, language filtering, deduplication, and semantic annotation to ensure high data quality and relevance.
- Evaluation reveals that semantic filtering significantly improves model data efficiency and performance, as evidenced by lower perplexity in domain-specific subsets.
The ClassiCC-PT Corpus is a large-scale, domain-annotated, and semantically filtered Portuguese-language textual resource developed for effective continued pretraining of LLMs in Portuguese. Constructed via industrial-scale cleaning of Common Crawl data and augmented with domain labels for education and STEM relevance, it is designed to support both broad-coverage and specialized adaptation of transformer models, particularly for high-fidelity natural language understanding and generation in the Portuguese language (Almeida et al., 14 Dec 2025). The corpus underpins advances in dialog systems and educational models, and its construction procedure epitomizes state-of-the-art practices in data curation and semantic filtering for LLM pretraining.
1. Corpus Composition and Statistics
The ClassiCC-PT Corpus comprises approximately documents and a total token count . Documents were sourced primarily from the Common Crawl, reflecting the diversity and scale needed for robust LLM adaptation. The corpus includes multiple domain-tagged subsets suitable for domain-adaptive pretraining.
| Domain | Tokens () | Percentage () |
|---|---|---|
| General web | ||
| News and press | ||
| Social-media style | ||
| Educational material | ||
| STEM content |
A curated 10-billion-token educational/STEM subset is also available, comprising tokens in educational documents and in STEM texts, representing 80% and 20% of this semantic slice, respectively.
2. Data Acquisition and Preprocessing
Raw text extraction follows the CCNet/Trafilatura pipeline, emphasizing high-yield Portuguese document retrieval. Boilerplate and irrelevant content are eliminated using the GoldMiner/Justext algorithm. The following preprocessing steps are enforced:
- Minimum document length: tokens; documents longer than tokens are split or truncated.
- Language filtering: Documents are retained only if using a fastText classifier for high confidence in Portuguese language identification.
- Deduplication relies on 9-gram locality sensitive hashing, discarding pairs with Jaccard similarity exceeding , where only the shorter or higher-quality page is retained.
- Tokenization is standardized to the LLaMA-2 BPE tokenizer (vocabulary size: 32k).
These steps yield a corpus with minimal redundancy and high language purity, supporting robust downstream model performance.
3. Semantic Annotation and Subset Selection
Each corpus document receives real-valued āeducational relevanceā () and āSTEM relevanceā () scores from logistic regression classifiers trained on annotated Portuguese corpora. For the CuriO-Edu training set, selection is performed with the semantic filter , yielding tokens. This enables both large-scale general pretraining and highly curated expert-domain continued pretraining for LLMs.
4. Corpus-level Quality Assessment
The ClassiCC-PT Corpus was evaluated for vocabulary coverage and LLM perplexity:
- Vocabulary coverage () on a 5M-token test set is approximately , with coverage defined as
where is the LLaMA-2 BPE vocabulary.
- Perplexity on the unfiltered corpus is , while on the educational/STEM filtered subset, perplexity falls to . This reduction is indicative of more ālearnableā and less noisy text, confirming that the semantic filter amplifies data quality.
5. Impact on LLM Continued Pretraining
Continued pretraining of LLaMA-2 7B on the full ClassiCC-PT corpus (Curió-7B) versus the 10B-token educational/STEM subset (Curió-Edu-7B) reveals that semantic filtering yields superior data efficiency and end-task performance. Curió-Edu-7B surpasses 32 NPM (Normalized Perplexity Metric) after tokens, whereas the full-corpus model requires nearly tokens to reach the same level. Peak NPM is 36.3 for Curió-Edu-7B, compared to 34.5 for the unfiltered baseline, demonstrating that targeted semantic selection can outweigh sheer corpus size for domain-adapted LLMs, even when pretraining from a minimal Portuguese base (Almeida et al., 14 Dec 2025).
6. Relation to Dialogue and Affective Modeling
While the mainline ClassiCC-PT corpus is focused on general and domain-specific continued pretraining, it is important to distinguish the āpositively transitioned sentiment dialogue corpusā of the same name (Wang et al., 2022). This resource comprises 67,205 multi-turn Twitter dialogues (302,475 utterances) labeled for explicit transitions from negative or neutral to positive sentiment. constructed to augment open-domain chatbot models with the ability to recognize and actively steer affective state transitions in conversation. Dialogue labeling is achieved by the intersection (āvotingā scheme) of BERT-based sentiment prediction (trained on Google Play Store App reviews) and a 1,008-entry emoji-to-sentiment mapping. Cohenās on a held-out sample demonstrates substantial reliability. This dialogue corpus is integral for affective model fine-tuning, as evidenced by improved metrics in emotion-affective chatbot benchmarks (Wang et al., 2022).
7. Availability and Use Cases
ClassiCC-PT and its domain-labeled subsets are publicly released resources suitable for a broad range of NLP tasks, including but not limited to: LLM continued pretraining for Portuguese, educational content generation, STEM question answering, and affective dialog modeling. Pretrained models such as Curió-7B and Curió-Edu-7B are accessible at dedicated repositories.
A plausible implication is that the combination of large-scale, domain-filtered, and semantically labeled language resources embodied by ClassiCC-PT establishes new best practices for resource construction in underrepresented languages and specialized application domains (Almeida et al., 14 Dec 2025, Wang et al., 2022).