SCAND Corpus in The Nordic Pile
- The SCAND Corpus is a curated dataset featuring web-crawled texts and historical newspapers for pre-training large language models in North Germanic languages.
- It employs advanced cleaning, filtering, and deduplication techniques, ensuring high-quality data through rigorous metadata structuring and language segmentation.
- Its two main components—LES Nordic Web Data and ENO historical newspapers—support cross-domain analysis and diachronic studies of Scandinavian texts.
The SCAND Corpus is a major subcomponent of The Nordic Pile, a curated 1.2TB multilingual dataset designed to facilitate pre-training of LLMs for North Germanic languages. The SCAND Corpus, as referenced in the foundation literature, consists primarily of two key resources: (1) the web-crawled “LES – Nordic Web Data” and (2) the ENO (“Enevaeldens Nyheder Online”) historical Danish-Norwegian newspaper dataset. These resources collectively represent the largest targeted collection of large-scale Nordic web text and diachronic press corpora, with precise data cleaning, metadata structuring, and deduplication regimes supporting model training and cross-domain analysis (Öhman et al., 2023, Heinsen et al., 2 Sep 2025).
1. Corpus Scope and Definition
The SCAND Corpus as used in The Nordic Pile comprises two principal collections:
- LES – Nordic Web Data: Web-crawled textual data in Danish, Icelandic, Norwegian, and Swedish, sourced and filtered from Common Crawl (via OSCAR, mC4, and a custom Trafilatura-based pipeline) (Öhman et al., 2023).
- ENO (Enevaeldens Nyheder Online): Digitized historical Danish and Norwegian newspaper texts (1660–1849), extracted from microfilm using neural OCR and advanced post-processing (Heinsen et al., 2 Sep 2025).
Both datasets exclude English at the corpus assembly stage and are grouped under the SCAND (“Scandinavian”) section in The Nordic Pile’s folder hierarchy.
2. Data Acquisition and Preprocessing
LES – Nordic Web Data
Crawling and extraction utilize Common Crawl slices filtered via OSCAR and mC4, augmented by a custom LES pipeline employing Trafilatura for more aggressive and language-specific extraction. All domains are publicly available, with exclusion of personal-data-heavy sites enforced at the URL level for compliance with GDPR and licensing standards.
Metadata attached to each document includes:
- Predicted language (lang) via fastText
- Numerical statistics (num_chars, num_utf8bytes, num_words, num_sents)
- 128-bit md5 hash of the document’s UTF-8 encoded text
ENO Historical Newspapers
Source materials comprise:
- ~565,000 microfilm pages from 28 Danish/Norwegian newspaper titles; digitized in varying stages since the 1950s
- Danish content via the Royal Library’s Mediestream API; Norwegian via high-resolution digitization of three major urban titles
Preprocessing comprises image binarization, de-skewing, and denoising, followed by automated layout segmentation (Transkribus "Fields Model") to preserve column structure and reading order.
3. Cleaning, Filtering, and Deduplication Procedures
LES – Nordic Web Data
The pipeline consists of seven ordered stages:
- Normalization: Removal of control/non-printing characters, whitespace normalization, Unicode NFC.
- Computation of metrics.
- Sixteen Boolean Quality Filters: Mandatory for inclusion. Representative filters include:
- ≥80% of words must contain alphabetic characters
- Digit fraction <0.2
- Mean word length in [2,10] characters
- Stop-word coverage ≥2 stop-words and ≥10% of total words
- Supported language: da, is, nb, sv
- “Repetitive Gopher” controls for duplicate n-grams, lines, and paragraphs (e.g., duplicate line/paragraph fraction ≤0.35, duplicate 2-gram character ≤0.25, duplicate 10-gram ≤0.15)
- Exact Deduplication: Removal of documents with duplicate md5 hashes.
- Language Segmentation: Partitioning by predicted language for subsequent fuzzy deduplication.
- Fuzzy Deduplication: Uses MinHash LSH approach—10-character shingle sets, 10-integer signature, b=2 bands. Threshold for deduplication is Jaccard similarity sim(Ci,Cj) ≥ 0.5.
- Merge and assembly.
This workflow removed approximately 20% of the original web-crawled data, chief among them those failing stop-word or repetitive pattern thresholds (Öhman et al., 2023).
ENO – Historical Press Corpus
OCR is performed using a Transkribus-based architecture:
- Four convolutional "feature extraction" blocks
- Two bi-directional LSTM layers
- CTC loss for alignment ()
- Character set: Fraktur (historic) plus Antikva fallback
Post–OCR word accuracy per text () is computed as the fraction of words found in a union dictionary (literary corpora, census names, frequent tokens). Low-PWA texts and language-shift (e.g., German passages) are flagged but retained with explicit "pwa" metadata field.
Standalone articles are derived using a two-stage, line-level segmentation:
- Random Forest based on structural features
- BERT-augmented Random Forest classifier, using contextual embeddings from a DA-BERT_Old_News_V1 model pretrained on 260M words from ENO
Performance is validated at F₁ = 98.9% for line segmentation on held-out 7,000 lines. Key parameters, including batch sizes, max sequence lengths, and learning rates, are specified in (Heinsen et al., 2 Sep 2025).
4. Corpus Composition and Statistics
Aggregate Composition
| Category | Documents (M) | Size (GB) | Mean doc size (KB) |
|---|---|---|---|
| LES – Nordic Web Data | 79.16 | 76.83 | 0.97 |
SCAND (LES) constitutes 38.18% of the 1,208.7GB Nordic Pile aggregate.
Language Breakdown (All Categories; Not Exclusive to LES)
- Danish: 10.8% (~130.5GB total; Web CC subset 111.33GB)
- Icelandic: 1.59% (~19.2GB total; Web CC subset 8.79GB)
- Norwegian: 11.55% (~139.6GB total; Web CC subset 90.00GB)
- Swedish: 26.02% (~314.5GB total; Web CC subset 188.94GB)
Exact per-language document counts within LES are not specified; the only available aggregate for historical newspapers (ENO) is ≈474 million word tokens (Heinsen et al., 2 Sep 2025).
5. Metadata, Licensing, and Integration
Metadata Practices and Schema
For ENO, each standalone text is associated with:
1 2 3 4 5 6 7 8 9 10 11 |
{
"text_id": "KobenhAv_17850315_00023",
"publication": "Københavns Adresseavis",
"date": "1785-03-15",
"page_id": "KobenhAv_17850315_p005",
"text_segment": 23,
"pwa": 0.96,
"lang": "da",
"ocr_model": "ENO-Fraktur-CTC-v1",
"bert_seg_model": "ENO-SetFit-Seg-v0.1"
} |
Recommended organization for the SCAND section is:
1 2 3 4 |
scand/
eno/
metadata.jsonl
texts/{publication_name}/{YYYY-MM}/{publication}_{date}_{text_id}.txt |
All ENO data and models are licensed CC BY-SA 4.0. Web data are derived from public domain sources and filtered for legal compliance (Heinsen et al., 2 Sep 2025, Öhman et al., 2023).
6. Intended Use and Interoperability
SCAND (LES – Nordic Web Data plus ENO) is one of nine major categories in The Nordic Pile, responsible for 38% of the corpus by volume and providing a critical backbone of general-domain, web-sourced, and specialized historical data for Danish, Icelandic, Norwegian, and Swedish. Its intended use is for pre-training LLMs (multi-billion parameter scale) to achieve both broad and Nordic-specific linguistic competence.
Alignment fields across datasets (e.g., "publication", "date", "lang", "text_id", "pwa") support diachronic, cross-lingual, and region-wide studies; LLMs and BERT embeddings are recommended for shared representation. A plausible implication is that the SCAND Corpus design fosters reproducibility and comparability across complementary corpora in the Nordic Pile, and supports the layered cleaning/filtering regimes critical for high-quality model pre-training (Öhman et al., 2023, Heinsen et al., 2 Sep 2025).