Papers
Topics
Authors
Recent
2000 character limit reached

CommonCrawl Corpus Overview

Updated 21 January 2026
  • CommonCrawl corpus is a public, multi-petabyte dataset comprising web snapshots dating back to 2008, capturing hundreds of billions of pages across diverse languages.
  • It provides versatile data in formats like WARC, WET, WAT, and CDX, supporting NLP, LLM pretraining, and web science through advanced extraction and cleaning pipelines.
  • Researchers leverage CommonCrawl’s open-access model to build specialized corpora that mitigate bias, enhance language coverage, and enable scalable analyses of global web content.

The CommonCrawl corpus is an extensive, regularly updated public dataset comprising multi-petabyte-scale web-crawled data spanning hundreds of billions of web pages in hundreds of languages, captured by the non-profit Common Crawl Foundation. The corpus serves as the backbone for a large and growing class of datasets used in NLP, LLM pretraining, web science, geography, and computational social science. The core data asset is the set of WARC (Web ARChive) files per monthly crawl, accompanied by auxiliary products such as WET (text-only) files, WAT (metadata), and richly structured CDX indices. At current scale, a single monthly archive surpasses 50–100 TB (compressed) of raw HTML—enabling analyses of web structure and language use at a previously inaccessible global scale (Kolias et al., 2014, Thompson, 2024).

1. Structure, Evolution, and Distribution of the Corpus

The CommonCrawl corpus consists of monthly web snapshots dating back to 2008, with each release ("CC-MAIN-yyyy-xx") comprising thousands of WARC files. These files archive HTTP responses with full page content, URL, HTTP headers, and crawl metadata. From each WARC dump, derivative products are released:

  • WARC: Full HTTP payloads (HTML, CSS, JS, PDF, etc.), packaged into multi-gigabyte GZip files.
  • WET: Extracted UTF-8 plaintext content from HTML responses; line-segmented but losing document structure.
  • WAT: Extracted metadata and link graphs.
  • CDX: Sorted, block-indexed URI indices with per-capture metadata (URL, timestamp, checksum, language code, MIME type) (Thompson, 2024).

The corpus grows continually, with over 100 billion web pages currently available. Its open-access model makes it the primary source for web-scale public datasets in NLP and web science.

2. Language, Domain, and Content Coverage

CommonCrawl boasts remarkable breadth across languages, topics, and geographies. Recent snapshots contain web pages in hundreds of languages and scripts. Language detection is carried out with models such as FastText-LID, CLD2, idNet, and custom classifiers, either at the line, paragraph, or document level (Kargaran et al., 2024, Abadji et al., 2022, Dunn, 2020). For example, the GlotCC pipeline identifies 2102 language-script pairs, achieving broad coverage—over 1275 ISO-639-3 + ISO-15924 pairs are detected in a 2 TB extracted subset (Kargaran et al., 2024).

Web content is primarily HTML (≈90–95%), but other resource types—including PDFs, XML, CSS, and JSON—are included and can be indexed (Kolias et al., 2014, Turski et al., 2023). Specialized corpora such as CCpdf have been constructed by mining CC for PDF documents spread across thousands of hosts in multiple languages, yielding over a million downloadable PDFs (Turski et al., 2023).

Geographic coverage is quantified using domain TLDs (.de, .fr, etc.), CDX metadata, and explicit information in page content (coordinates, street addresses) (Ilyankou et al., 2024, Dunn, 2020, Dunn, 2024). For explicit geospatial references, 18.7% of analyzed CC documents contain addresses or coordinates, a prevalence matched across English and non-English data (Ilyankou et al., 2024).

3. Extraction, Cleaning, Language ID, and Filtering Pipelines

Extraction and cleaning pipelines are implemented to transform raw CC data into application-ready corpora. These pipelines have grown increasingly elaborate:

  • Text extraction: HTML content is parsed to segment main content from boilerplate, advertisements, code snippets, and navigation elements. Early work used heuristic filtering (e.g., line density, tag structures); more recent approaches employ model-based HTML parsers (e.g., MinerU-HTML) optimized for semantic fidelity and structure preservation (Ma et al., 20 Nov 2025).
  • Language identification (LID): Multiple methods are employed—FastText-based classifiers, ensemble n-gram models, neural networks (GlotLID, idNet). These are trained on Wikipedia, news, and domain-specific corpora, and operate at document, line, or token granularity, with open-set rejection to filter out unknown scripts or web noise (Kargaran et al., 2024, Dunn, 2020).
  • Noise and content filtering: Quality filters check for characteristics such as anomalously short lines, excessive boilerplate, repetition, script inconsistency, and technical (numeric/punctuation-heavy) or adult content. Many pipelines employ explicit blacklists (e.g., adult-domain blocklists), regular expressions, script-consistency heuristics, and crowd-verified PII/unsafe content classifiers (Kargaran et al., 2024, Qiu et al., 2024).
  • Deduplication: Both exact and fuzzy deduplication are performed. Techniques include cryptographic hashes at the document, paragraph, or sentence level (SHA-256, TLSH), and MinHash/LSH-based shingling for approximate matching. For example, the FuLG Romanian pipeline applies an initial exact hash deduplication followed by a fuzzy shingling pass with a Jaccard threshold ≥0.8, reducing duplicates by as much as 90% in large slices (Bădoiu et al., 2024, Gutiérrez-Fandiño et al., 2022).

A summary of main pipeline stages (using the GlotCC design) is provided below:

Stage Tool/Algorithm Output/Decision
Data ingestion WAT/WET map-reduce Extracted text blocks
LID (document/line) GlotLID v3.0 Language tag, confidence
Noise filtering Heuristics/regex Drop on quality warnings
Cleaning Script/word filters Drop inconsistent/unsafe docs
Deduplication TLSH Assign hash for downstream dedup

(Kargaran et al., 2024, Gutiérrez-Fandiño et al., 2022, Qiu et al., 2024)

4. Specialized Corpora and Derivatives

A proliferation of derivative corpora have been engineered atop CommonCrawl. Key examples:

  • GlotCC: A 684.7M-document, 512.6B-word corpus covering 1275 language-script pairs, designed to maximize minority language inclusion with reproducible, open pipelines (Kargaran et al., 2024).
  • esCorpius: 322.5 GB, 50B-token Spanish corpus preserving document/paragraph structure and deduplicated by exact and LSH-based methods (Gutiérrez-Fandiño et al., 2022).
  • FuLG: 156B-token Romanian corpus built via dual-pass deduplication and Gopher-style data cleaning (Bădoiu et al., 2024).
  • WanJuan-CC: >1T-token English web corpus with explicit quality and safety filtering and open-source data (Qiu et al., 2024).
  • DepCC: 252B-token linguistically-annotated English corpus with full dependency parses and named-entity tags in CoNLL format (Panchenko et al., 2017).
  • CGLU/CGLU+: Multi-hundred-billion word multi-country/multilingual corpora, mapping language and geography at fine granularity and offering metrics on global variation, dialects, and geospatial presence (Dunn, 2020, Dunn, 2024).
  • AICC: 7.3T-token multilingual corpus that uses a model-based HTML-to-structured-Markdown pipeline for high-fidelity extraction, with evidence that improved extraction increases LLM downstream performance (Ma et al., 20 Nov 2025).

Across these and similar efforts, cleaning, deduplication, and domain balancing are essential for curating high-utility resources for training and evaluation.

5. Undesirable Content, Bias, and Corpus Quality

Despite aggressive filtering, empirical studies show CC-based resources retain a non-trivial fraction of undesirable content. For instance, analysis of a 2020 English CC sample found 4–6% of documents contain hate speech and 1–2% contain sexually explicit content, measured by multiple detector types (BERT classifiers, n-gram lexica, etc.) (Luccioni et al., 2021). Perplexity-based filtering is insufficient to eliminate such content, as hate/explicit indicators correlate poorly (Pearson r ≈ 0) with LM perplexity (Luccioni et al., 2021). Scaling law results predict that as LLMs increase capacity, they more readily memorize rare toxic examples unless such data is explicitly downweighted or removed.

Corpus quality tradeoffs and their relation to downstream NLU performance have been investigated in low-resource settings (Artetxe et al., 2022). While domain-specific, high-quality crawls (e.g., EusCrawl for Basque) yield cleaner data (66% high-quality docs vs. <33% for mC4/CC100), downstream NLU performance is found to depend more on corpus size and domain breadth than raw text quality for many tasks.

Systematic cleaning steps—agreement across LID models, deduplication, outlier detection, and location-sensitive models—yield corpora that are measurably closer (using measures such as Jensen–Shannon divergence) to hand-curated baselines and social media registers (e.g., geolocated Twitter corpora), but with a risk of over-pruning under-represented populations (Dunn, 2024).

6. Licensing, Access, and Reproducibility

CommonCrawl data and derivative corpora are generally released under open data licenses—most often variants of CC-BY (for attribution) or the original Common Crawl Terms of Use (see http://commoncrawl.org/terms-of-use) (Kolias et al., 2014, Panchenko et al., 2017, Kargaran et al., 2024). Many pipelines and datasets are open-sourced (code, indices, filters, annotated metadata) to enable full reproducibility and auditability (e.g., (Kargaran et al., 2024, Qiu et al., 2024)).

Standard interfaces for querying or downloading processed corpora include:

  • S3/GCS URIs for bulk downloading WARC/WET shards.
  • JSONL, CoNLL, or Markdown for filtered/document-level corpora.
  • Elasticsearch/Kibana indices for linguistically enriched corpora (DepCC) (Panchenko et al., 2017).
  • Open Hugging Face Datasets for easy access to language- or task-specific corpora.

Hardware and compute requirements for parsing and processing full snapshots are substantial (e.g., 340 wall-clock hours with 48 processes for a single GlotCC snapshot on an Intel Xeon E7-8857 host) (Kargaran et al., 2024).

7. Implications, Limitations, and Future Directions

The pervasive use of CommonCrawl-derived corpora in LLM and NLP research is both a driver of innovation and a locus of recurring challenges:

  • Bias and coverage: Overrepresentation of English, .com domains, and high-resource languages persists even after balancing, influencing LLM knowledge distribution (Dunn, 2020, Ilyankou et al., 2024).
  • Content quality: Aggressive filtering (including removal of math/code) can exclude valuable scientific/technical documents, while monolingual or list-based filters can bias against code-switching or domain-mixed content (Kargaran et al., 2024, Ma et al., 20 Nov 2025).
  • Methodological best practices: Use of robust LID, MinHash-based deduplication, transparent cleaning, language/country mapping, and segment-based proxy analysis (using index metadata for representative sampling) are emerging standards (Thompson, 2024).
  • Audit and interpretability: Open metadata, audit logs, and structured tags are key for error analysis and selective corpus construction (including for GDPR/PII takedowns) (Gutiérrez-Fandiño et al., 2022, Panchenko et al., 2017).

Ongoing research targets enhanced extraction (with model-driven parsers), better safety filtering (adversarial robustness, PII masking), finer granular geolocation and dialect mapping, and sustainable workflows for continuous updating and evaluation (Kargaran et al., 2024, Ma et al., 20 Nov 2025, Ilyankou et al., 2024).


References (by arXiv id): (Kolias et al., 2014): Exploratory analysis of large-scale CommonCrawl (Panchenko et al., 2017): DepCC: Dependency-Parsed CommonCrawl (Kúdela et al., 2018): Parallel paragraph extraction from CommonCrawl (Dunn, 2020): CGLU: Mapping Languages (Luccioni et al., 2021): Undesirable content in CommonCrawl (Abadji et al., 2022): Document-oriented OSCAR from CommonCrawl (Artetxe et al., 2022): Corpus quality in low-resource languages (Gutiérrez-Fandiño et al., 2022): esCorpius: large-scale Spanish CC corpus (Turski et al., 2023): CCpdf: PDF extraction from CC (Qiu et al., 2024): WanJuan-CC: Safe English webtext (Dunn, 2024): Geographic validation in CC corpora (Thompson, 2024): Efficient longitudinal analytics using CC (Okazaki et al., 2024): Japanese web CC corpus (Ilyankou et al., 2024): Quantifying geospatial content in CC (Bădoiu et al., 2024): FuLG: Romanian web corpus (Kargaran et al., 2024): GlotCC: Broad-coverage CC for minority languages (Ma et al., 20 Nov 2025): AICC: Model-based HTML extraction for LLMs.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CommonCrawl Corpus.