Intentional inclusion of copyrighted bestsellers in LLM training data
Ascertain whether companies developing large language models intentionally include copyrighted bestselling books in their training datasets, distinguishing deliberate curation from incidental inclusion via large-scale web scraping to clarify the provenance of model exposure to such works.
Sponsor
References
Secondly, we include a set of 15 copyrighted bestsellers. Although it is unclear whether companies intentionally incorporate these works, the widespread unauthorized distribution of such books across the internet makes it highly probable that most models have, to some extent, been exposed to them through large-scale web scraping.
— RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
(2510.25941 - Duarte et al., 29 Oct 2025) in Subsection 3.1.1 (EchoTrace Benchmark) — Books paragraph