EU Law Compliance of Using Common Crawl Data for LLM Training
Determine whether using web content scraped and redistributed by the Common Crawl association as training data for large language models complies with European Union copyright law, including the scope and implementation requirements of the Text and Data Mining exception in Article 4 of the CDSM Directive.
References
However, it is not clear-cut whether using data scraped by the Common Crawl association is in compliance with EU law when used in LLM training.
— GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training
(2604.00920 - Oort et al., 1 Apr 2026) in Section 3.2 (The Law Perspective), paragraph “TDM-exception and unclear licensing of web data”