EU Law Compliance of Using Common Crawl Data for LLM Training

Determine whether using web content scraped and redistributed by the Common Crawl association as training data for large language models complies with European Union copyright law, including the scope and implementation requirements of the Text and Data Mining exception in Article 4 of the CDSM Directive.

Background

The authors explain that while Common Crawl aggregates public web content, the legality of using such data for LLM training in the EU is uncertain due to the Text and Data Mining exception and the need to respect machine‑readable opt‑outs beyond robots.txt. They report the absence of robust tools to reliably detect all forms of machine‑readable opt‑outs, increasing compliance risk.

Because of this uncertainty, the corpus includes only portions of Common Crawl data with clearly permissive licensing (via their C5 pipeline) and excludes broad web‑scraped content whose compliance cannot be assured.

References

However, it is not clear-cut whether using data scraped by the Common Crawl association is in compliance with EU law when used in LLM training.

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training  (2604.00920 - Oort et al., 1 Apr 2026) in Section 3.2 (The Law Perspective), paragraph “TDM-exception and unclear licensing of web data”