Aleph-Alpha-GermanWeb Curation Pipeline

Updated 27 December 2025

Aleph-Alpha-GermanWeb is a comprehensive German-language dataset curated through meticulous filtering, synthetic augmentation, and quality classification.
The pipeline stages include Common Crawl filtering with language identification, extensive deduplication, and synthetic document generation using advanced LLM prompts.
Its scalable architecture and rigorous quality assessment yield significant performance gains in German-language LLM pre-training benchmarks.

Aleph-Alpha-GermanWeb is a large-scale German-language dataset designed for pre-training LLMs, constructed through a multistage curation pipeline that integrates heuristic and model-based filtering as well as substantial synthetic data generation. Its architecture leverages diverse web sources and advanced quality control mechanisms to optimize the balance between data scale and quality, demonstrating significant performance improvements in German-language language modeling tasks (Burns et al., 24 Apr 2025).

1. Pipeline Architecture and Workflow

The Aleph-Alpha-GermanWeb curation pipeline is implemented using the NeMo Curator and is structured in three primary stages:

Common-Crawl Filtering: Sequential processing of Common Crawl (CC) web dumps through URL and content filtering, language identification, repetition and quality heuristics, followed by rigorous exact and fuzzy deduplication.
Synthetic Data Generation: Production of varied synthetic documents (rephrasings, summaries, Q&A pairs, item lists, Wikipedia-style passages) conditioned on organic web data, primarily using the Mistral-Nemo-Instruct-2407 (12B) model.
Quality Classification & Bucketing: Heuristic- and model-based classifiers assess the grammatical and educational quality of all candidate documents, assign integer “quality scores,” and bucket the data for final assembly.

The final Aleph-Alpha-GermanWeb dataset is the union of (a) filtered Common Crawl output, (b) synthetic documents, and (c) the full German portion of FineWeb2. The integration process is visualized as:

Source	Processing Path Summary
Raw CC dumps	URL Filtering → Text Extraction → Lang. ID → Repetition & Quality Heuristics → Deduplication
FineWeb2	Synthetic Document Generation
All Sources	Grammar & Edu Classifiers → Scoring & Bucket Assignment → Final Dataset Assembly

2. Heuristic Filtering on Web Data

A comprehensive suite of heuristics is applied to Common Crawl data, following procedures from RefinedWeb and FineWeb2:

URL Filtering: Junk domains (4.6M blocklist entries), regex for “adult”/“scam” keywords, and exclusion of high-quality domains (which are later re-injected in controlled amounts).
Text Extraction: Use of the resiliparse extractor, selected based on downstream LLM-task performance.
German-Language Identification: fastText CCNet classifier over character n-grams (176 languages), retaining only documents with German as the top prediction.
Repetition Removal: Multi-level duplication checks (lines, paragraphs, n-grams) with precise thresholds, e.g., duplicate lines > 28.2% leads to document removal.
Document and Line Heuristics: Hard thresholds on length, mean word length, symbol density, bullet/ellipsis prevalence, alphabetic content, and frequency of specific German stopwords. Additional line-based checks remove documents with anomalously high digit content, uppercase rates, low word-per-line averages, or boilerplate rate > 40%.
Deduplication: Hash-based global deduplication, followed by MinHash with LSH (14 buckets × 8 hash functions on character 5-grams of shingle length 23) for fuzzy duplicate detection—retaining one document per near-duplicate cluster.

Reduction at each stage is substantial: Out of 10.8B raw CC documents across six recent dumps, only 151.6M remain after global fuzzy deduplication.

Stage	Docs (M, total)
Original	10,879.6
URL+Lang. ID	664.4
Content filtered	375.9
Exact Dedup	269.9
Fuzzy Dedup	253.6
Global	151.6

3. Model-Based Filtering and Quality Assessment

Model-based filtering incorporates two main axes: grammar and “educational” quality, each handled by dedicated classifier ensembles.

Grammar Classifiers: fastText and BERT models are trained on 400k FineWeb2 documents labeled via LanguageTool's DE_AGREEMENT rule (precision/recall up to 67%).
Educational-Quality Classifiers: Mistral-Nemo-Instruct-2407 evaluates 600k documents (labeling on content, language, orthography, scored 1–5). Separate fastText (binary) and BERT (multi-class) classifiers predict composite educational scores.
- Minimum of the three axis scores is assigned per-document to enforce stringent acceptance.
- Classifier metrics: fastText 92% precision/91.5% recall for distinguishing high (4–5) vs. low (1–2); BERT multi-class 42% accuracy/46% macro-F1.
Ensembling and Bucketing: A numerical quality score $S(d)$ is computed for each document by summing points as detailed in Table 3.1. Documents are then bucketed by final score:

Bucket	Score Range	Filtered CC (%)	Synthetic (%)	FineWeb2 (%)
High	≥ 10	20.7	22.1	16.5
Medium-High	8–9	23.7	28.0	19.5
Medium	6–7	17.3	21.9	15.4
Medium-Low	4–5	14.8	14.6	14.8
Low	< 4	23.4	13.4	33.8

Numerical scoring formula: $S(d) = \sum_{i=1}^9 w_i \mathbf{1}_{\mathrm{condition}_i(d)} - 3\,\mathbf{1}_{\{|d|<100 \wedge S(d)>6\}$ with $w \in \{3, 2, 1\}$ per Table 3.1, $\mathbf{1}$ is the indicator.

4. Synthetic Data Generation

Synthetic documents are generated to enrich the corpus and amplify data quality:

Templates: Five prompt types—rephrasings, summaries, Q&A, item lists, Wikipedia-style passages—with prompts in German (examples in Appendix A).
Chunking: To fit within LLM context windows, FineWeb2 documents are segmented into $\leq N$ -character chunks (e.g., $N=2048$ ); splits respect sentence and paragraph boundaries.
Generation Loop:

synthetic_corpus = []
for doc in FineWeb2:
    for prompt_template in [P0, …, P4]:
        for chunk in chunk(doc, max_chars):
            out = LLM.generate(prompt=prompt_template.format(document=chunk))
            cleaned = post_process(out)   # strip prefixes like “Umformulierung:”
            synthetic_corpus.append(cleaned)

Post-processing: Removal of LLM-generated prefix artifacts and repeated prompt tokens using regular expressions and heuristic stripping.
Epoching: To mitigate overfitting, synthetic expansion is capped at five epochs per document/template pair, consistent with multi-epoch studies.

Synthetic data scale: Up to $5\times$ the number of FineWeb2 documents, contingent on chunking granularity.

5. Dataset Composition and Scale

After all stages, the Aleph-Alpha-GermanWeb dataset has the following structure:

Filtered CC: ~151.6M documents post-heuristics/deduplication.
Synthetic: Up to $5\times$ FineWeb2 documents, given prompt and chunk diversity.
FineWeb2: Complete German subset ( $\sim1.4\%$ of the 8TB FineWeb2 dataset).

For model pre-training, three corpus settings are used:

Filtered CC only
Synthetic only
1 : 1 : 1 mixture (by token-count) of {Filtered CC, Synthetic, FineWeb2}

The 1 : 1 : 1 mixture isolates the impact of each source under equivalent token budgets (e.g., ~84B tokens for 1B models, ~150B words for 8B models).

6. Evaluation Methodology and Benchmarks

Pre-training experiments with Aleph-Alpha-GermanWeb compare against FineWeb2 baselines using two architectures:

1B Llama-style: 16-layer, 2048-dim, 32 heads, RoPE, SwiGLU, cosine LR decay; trained for 40,000 steps on 84B tokens with batch size 512.
8B Tokenizer-Free HAT: Hierarchical autoregressive transformer utilizing a 3-module byte-to-word approach, eliminating tokenizer bias; trained on 150B words for CC/FineWeb2 mixtures, 63B for synthetic/FineWeb2.

Benchmarked on:

MMMLU (German: 14k examples, 57 subjects)
ARC-Easy (German translation)
HellaSwag (German translation)
TruthfulQA (excluded at 1B due to lack of signal)

Results indicate consistent and significant performance gains for Aleph-Alpha-GermanWeb over FineWeb2, persisting even when FineWeb2 is augmented with human-curated data such as Wikipedia.

7. Significance and Broader Context

The Aleph-Alpha-GermanWeb curation pipeline exemplifies the synthesis of large-scale web data with model-driven filtering and advanced synthetic document generation. Precise removal of noise, aggressive deduplication, and the use of ensemble classifier scoring yield a corpus with demonstrable benefits for German-language LLM pre-training. The methodology affirms that improvements in data curation—especially the combination of model-based scoring and large-scale synthetic augmentation—translate directly into higher downstream LLM performance in controlled benchmark evaluations (Burns et al., 24 Apr 2025). This suggests that further optimization of curation and synthetic methodologies may continue to advance pre-training efficacy for low-resource and non-English languages.

PDF Markdown Chat (Pro)

References (1)

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Aleph-Alpha-GermanWeb Curation Pipeline.