Papers
Topics
Authors
Recent
2000 character limit reached

ClassiCC-PT Corpus for Portuguese LLM Training

Updated 21 December 2025
  • ClassiCC-PT Corpus is a large-scale, domain-annotated Portuguese text resource that supports both general LLM pretraining and specialized adaptation for education and STEM.
  • Its construction uses industrial-scale cleaning, language filtering, deduplication, and semantic annotation to ensure high data quality and relevance.
  • Evaluation reveals that semantic filtering significantly improves model data efficiency and performance, as evidenced by lower perplexity in domain-specific subsets.

The ClassiCC-PT Corpus is a large-scale, domain-annotated, and semantically filtered Portuguese-language textual resource developed for effective continued pretraining of LLMs in Portuguese. Constructed via industrial-scale cleaning of Common Crawl data and augmented with domain labels for education and STEM relevance, it is designed to support both broad-coverage and specialized adaptation of transformer models, particularly for high-fidelity natural language understanding and generation in the Portuguese language (Almeida et al., 14 Dec 2025). The corpus underpins advances in dialog systems and educational models, and its construction procedure epitomizes state-of-the-art practices in data curation and semantic filtering for LLM pretraining.

1. Corpus Composition and Statistics

The ClassiCC-PT Corpus comprises approximately Ndocsā‰ˆ30Ɨ106N_{\text{docs}}\approx 30\times 10^6 documents and a total token count Ntotal=120Ɨ109N_{\text{total}}=120\times 10^9. Documents were sourced primarily from the Common Crawl, reflecting the diversity and scale needed for robust LLM adaptation. The corpus includes multiple domain-tagged subsets suitable for domain-adaptive pretraining.

Domain Tokens (NiN_i) Percentage (pip_i)
General web 72Ɨ10972\times 10^9 60%60\%
News and press 9.6Ɨ1099.6\times 10^9 8%8\%
Social-media style 8.4Ɨ1098.4\times 10^9 7%7\%
Educational material 20Ɨ10920\times 10^9 16.7%16.7\%
STEM content 10Ɨ10910\times 10^9 8.3%8.3\%

A curated 10-billion-token educational/STEM subset is also available, comprising 8Ɨ1098\times 10^9 tokens in educational documents and 2Ɨ1092\times 10^9 in STEM texts, representing 80% and 20% of this semantic slice, respectively.

2. Data Acquisition and Preprocessing

Raw text extraction follows the CCNet/Trafilatura pipeline, emphasizing high-yield Portuguese document retrieval. Boilerplate and irrelevant content are eliminated using the GoldMiner/Justext algorithm. The following preprocessing steps are enforced:

  • Minimum document length: Lmin⁔=50L_{\min}=50 tokens; documents longer than Lmax⁔=2000L_{\max}=2000 tokens are split or truncated.
  • Language filtering: Documents are retained only if P(lang=pt)>0.95P(\text{lang}=\text{pt})>0.95 using a fastText classifier for high confidence in Portuguese language identification.
  • Deduplication relies on 9-gram locality sensitive hashing, discarding pairs with Jaccard similarity exceeding Smax⁔=0.8S_{\max}=0.8, where only the shorter or higher-quality page is retained.
  • Tokenization is standardized to the LLaMA-2 BPE tokenizer (vocabulary size: 32k).

These steps yield a corpus with minimal redundancy and high language purity, supporting robust downstream model performance.

3. Semantic Annotation and Subset Selection

Each corpus document receives real-valued ā€œeducational relevanceā€ (sedus_{\text{edu}}) and ā€œSTEM relevanceā€ (sstems_{\text{stem}}) scores from logistic regression classifiers trained on annotated Portuguese corpora. For the CuriO-Edu training set, selection is performed with the semantic filter s=max⁔(sedu,sstem)>2.5s=\max(s_{\text{edu}},s_{\text{stem}})>2.5, yielding Nedu-subset=10Ɨ109N_{\text{edu-subset}}=10\times 10^9 tokens. This enables both large-scale general pretraining and highly curated expert-domain continued pretraining for LLMs.

4. Corpus-level Quality Assessment

The ClassiCC-PT Corpus was evaluated for vocabulary coverage and LLM perplexity:

  • Vocabulary coverage (CC) on a 5M-token test set is approximately 92%92\%, with coverage defined as

C=∣types(test)∩V∣∣types(test)āˆ£Ć—100%,C = \frac{|\mathrm{types}(\text{test}) \cap \mathcal{V}|}{|\mathrm{types}(\text{test})|}\times100\%,

where V\mathcal{V} is the LLaMA-2 BPE vocabulary.

  • Perplexity on the unfiltered corpus is PPLfullā‰ˆ28.3\mathrm{PPL}_{\mathrm{full}}\approx28.3, while on the educational/STEM filtered subset, perplexity falls to PPLeduā‰ˆ24.1\mathrm{PPL}_{\mathrm{edu}}\approx24.1. This reduction is indicative of more ā€œlearnableā€ and less noisy text, confirming that the semantic filter amplifies data quality.

5. Impact on LLM Continued Pretraining

Continued pretraining of LLaMA-2 7B on the full ClassiCC-PT corpus (Curió-7B) versus the 10B-token educational/STEM subset (Curió-Edu-7B) reveals that semantic filtering yields superior data efficiency and end-task performance. Curió-Edu-7B surpasses 32 NPM (Normalized Perplexity Metric) after 5Ɨ1095\times 10^9 tokens, whereas the full-corpus model requires nearly 80Ɨ10980\times 10^9 tokens to reach the same level. Peak NPM is 36.3 for Curió-Edu-7B, compared to 34.5 for the unfiltered baseline, demonstrating that targeted semantic selection can outweigh sheer corpus size for domain-adapted LLMs, even when pretraining from a minimal Portuguese base (Almeida et al., 14 Dec 2025).

6. Relation to Dialogue and Affective Modeling

While the mainline ClassiCC-PT corpus is focused on general and domain-specific continued pretraining, it is important to distinguish the ā€œpositively transitioned sentiment dialogue corpusā€ of the same name (Wang et al., 2022). This resource comprises 67,205 multi-turn Twitter dialogues (302,475 utterances) labeled for explicit transitions from negative or neutral to positive sentiment. constructed to augment open-domain chatbot models with the ability to recognize and actively steer affective state transitions in conversation. Dialogue labeling is achieved by the intersection (ā€œvotingā€ scheme) of BERT-based sentiment prediction (trained on Google Play Store App reviews) and a 1,008-entry emoji-to-sentiment mapping. Cohen’s Īŗā‰ˆ0.75\kappa\approx 0.75 on a held-out sample demonstrates substantial reliability. This dialogue corpus is integral for affective model fine-tuning, as evidenced by improved metrics in emotion-affective chatbot benchmarks (Wang et al., 2022).

7. Availability and Use Cases

ClassiCC-PT and its domain-labeled subsets are publicly released resources suitable for a broad range of NLP tasks, including but not limited to: LLM continued pretraining for Portuguese, educational content generation, STEM question answering, and affective dialog modeling. Pretrained models such as Curió-7B and Curió-Edu-7B are accessible at dedicated repositories.

A plausible implication is that the combination of large-scale, domain-filtered, and semantically labeled language resources embodied by ClassiCC-PT establishes new best practices for resource construction in underrepresented languages and specialized application domains (Almeida et al., 14 Dec 2025, Wang et al., 2022).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ClassiCC-PT Corpus.