SCAND Corpus in The Nordic Pile

Updated 1 February 2026

The SCAND Corpus is a curated dataset featuring web-crawled texts and historical newspapers for pre-training large language models in North Germanic languages.
It employs advanced cleaning, filtering, and deduplication techniques, ensuring high-quality data through rigorous metadata structuring and language segmentation.
Its two main components—LES Nordic Web Data and ENO historical newspapers—support cross-domain analysis and diachronic studies of Scandinavian texts.

The SCAND Corpus is a major subcomponent of The Nordic Pile, a curated 1.2TB multilingual dataset designed to facilitate pre-training of LLMs for North Germanic languages. The SCAND Corpus, as referenced in the foundation literature, consists primarily of two key resources: (1) the web-crawled “LES – Nordic Web Data” and (2) the ENO (“Enevaeldens Nyheder Online”) historical Danish-Norwegian newspaper dataset. These resources collectively represent the largest targeted collection of large-scale Nordic web text and diachronic press corpora, with precise data cleaning, metadata structuring, and deduplication regimes supporting model training and cross-domain analysis (Öhman et al., 2023, Heinsen et al., 2 Sep 2025).

1. Corpus Scope and Definition

The SCAND Corpus as used in The Nordic Pile comprises two principal collections:

LES – Nordic Web Data: Web-crawled textual data in Danish, Icelandic, Norwegian, and Swedish, sourced and filtered from Common Crawl (via OSCAR, mC4, and a custom Trafilatura-based pipeline) (Öhman et al., 2023).
ENO (Enevaeldens Nyheder Online): Digitized historical Danish and Norwegian newspaper texts (1660–1849), extracted from microfilm using neural OCR and advanced post-processing (Heinsen et al., 2 Sep 2025).

Both datasets exclude English at the corpus assembly stage and are grouped under the SCAND (“Scandinavian”) section in The Nordic Pile’s folder hierarchy.

2. Data Acquisition and Preprocessing

LES – Nordic Web Data

Crawling and extraction utilize Common Crawl slices filtered via OSCAR and mC4, augmented by a custom LES pipeline employing Trafilatura for more aggressive and language-specific extraction. All domains are publicly available, with exclusion of personal-data-heavy sites enforced at the URL level for compliance with GDPR and licensing standards.

Metadata attached to each document includes:

Predicted language (lang) via fastText
Numerical statistics (num_chars, num_utf8bytes, num_words, num_sents)
128-bit md5 hash of the document’s UTF-8 encoded text

ENO Historical Newspapers

Source materials comprise:

~565,000 microfilm pages from 28 Danish/Norwegian newspaper titles; digitized in varying stages since the 1950s
Danish content via the Royal Library’s Mediestream API; Norwegian via high-resolution digitization of three major urban titles

Preprocessing comprises image binarization, de-skewing, and denoising, followed by automated layout segmentation (Transkribus "Fields Model") to preserve column structure and reading order.

3. Cleaning, Filtering, and Deduplication Procedures

LES – Nordic Web Data

The pipeline consists of seven ordered stages:

Normalization: Removal of control/non-printing characters, whitespace normalization, Unicode NFC.
Computation of metrics.
Sixteen Boolean Quality Filters: Mandatory for inclusion. Representative filters include:
- ≥80% of words must contain alphabetic characters
- Digit fraction <0.2
- Mean word length in [2,10] characters
- Stop-word coverage ≥2 stop-words and ≥10% of total words
- Supported language: da, is, nb, sv
- “Repetitive Gopher” controls for duplicate n-grams, lines, and paragraphs (e.g., duplicate line/paragraph fraction ≤0.35, duplicate 2-gram character ≤0.25, duplicate 10-gram ≤0.15)
Exact Deduplication: Removal of documents with duplicate md5 hashes.
Language Segmentation: Partitioning by predicted language for subsequent fuzzy deduplication.
Fuzzy Deduplication: Uses MinHash LSH approach—10-character shingle sets, 10-integer signature, b=2 bands. Threshold for deduplication is Jaccard similarity sim(Ci,Cj) ≥ 0.5.
Merge and assembly.

This workflow removed approximately 20% of the original web-crawled data, chief among them those failing stop-word or repetitive pattern thresholds (Öhman et al., 2023).

ENO – Historical Press Corpus

OCR is performed using a Transkribus-based architecture:

Four convolutional "feature extraction" blocks
Two bi-directional LSTM layers
CTC loss for alignment ( $L_{CTC} = -\log p(y|x)$ )
Character set: Fraktur (historic) plus Antikva fallback

Post–OCR word accuracy per text ( $PWA(T)$ ) is computed as the fraction of words found in a union dictionary (literary corpora, census names, frequent tokens). Low-PWA texts and language-shift (e.g., German passages) are flagged but retained with explicit "pwa" metadata field.

Standalone articles are derived using a two-stage, line-level segmentation:

Random Forest based on structural features
BERT-augmented Random Forest classifier, using contextual embeddings from a DA-BERT_Old_News_V1 model pretrained on 260M words from ENO

Performance is validated at F₁ = 98.9% for line segmentation on held-out 7,000 lines. Key parameters, including batch sizes, max sequence lengths, and learning rates, are specified in (Heinsen et al., 2 Sep 2025).

4. Corpus Composition and Statistics

Aggregate Composition

Category	Documents (M)	Size (GB)	Mean doc size (KB)
LES – Nordic Web Data	79.16	76.83	0.97

SCAND (LES) constitutes 38.18% of the 1,208.7GB Nordic Pile aggregate.

Language Breakdown (All Categories; Not Exclusive to LES)

Danish: 10.8% (~130.5GB total; Web CC subset 111.33GB)
Icelandic: 1.59% (~19.2GB total; Web CC subset 8.79GB)
Norwegian: 11.55% (~139.6GB total; Web CC subset 90.00GB)
Swedish: 26.02% (~314.5GB total; Web CC subset 188.94GB)

Exact per-language document counts within LES are not specified; the only available aggregate for historical newspapers (ENO) is ≈474 million word tokens (Heinsen et al., 2 Sep 2025).

5. Metadata, Licensing, and Integration

Metadata Practices and Schema

For ENO, each standalone text is associated with:

{
  "text_id": "KobenhAv_17850315_00023",
  "publication": "Københavns Adresseavis",
  "date": "1785-03-15",
  "page_id": "KobenhAv_17850315_p005",
  "text_segment": 23,
  "pwa": 0.96,
  "lang": "da",
  "ocr_model": "ENO-Fraktur-CTC-v1",
  "bert_seg_model": "ENO-SetFit-Seg-v0.1"
}

Recommended organization for the SCAND section is:

scand/
  eno/
    metadata.jsonl
    texts/{publication_name}/{YYYY-MM}/{publication}_{date}_{text_id}.txt

All ENO data and models are licensed CC BY-SA 4.0. Web data are derived from public domain sources and filtered for legal compliance (Heinsen et al., 2 Sep 2025, Öhman et al., 2023).

6. Intended Use and Interoperability

SCAND (LES – Nordic Web Data plus ENO) is one of nine major categories in The Nordic Pile, responsible for 38% of the corpus by volume and providing a critical backbone of general-domain, web-sourced, and specialized historical data for Danish, Icelandic, Norwegian, and Swedish. Its intended use is for pre-training LLMs (multi-billion parameter scale) to achieve both broad and Nordic-specific linguistic competence.

Alignment fields across datasets (e.g., "publication", "date", "lang", "text_id", "pwa") support diachronic, cross-lingual, and region-wide studies; LLMs and BERT embeddings are recommended for shared representation. A plausible implication is that the SCAND Corpus design fosters reproducibility and comparability across complementary corpora in the Nordic Pile, and supports the layered cleaning/filtering regimes critical for high-quality model pre-training (Öhman et al., 2023, Heinsen et al., 2 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling (2023)

A World in Print: Introducing a Danish-Norwegian corpus of historical newspapers (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCAND Corpus in The Nordic Pile.

SCAND Corpus in The Nordic Pile

1. Corpus Scope and Definition

2. Data Acquisition and Preprocessing

LES – Nordic Web Data

ENO Historical Newspapers

3. Cleaning, Filtering, and Deduplication Procedures

LES – Nordic Web Data

ENO – Historical Press Corpus

4. Corpus Composition and Statistics

Aggregate Composition

Language Breakdown (All Categories; Not Exclusive to LES)

5. Metadata, Licensing, and Integration

Metadata Practices and Schema

6. Intended Use and Interoperability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics