Intentional inclusion of copyrighted bestsellers in LLM training data

Ascertain whether companies developing large language models intentionally include copyrighted bestselling books in their training datasets, distinguishing deliberate curation from incidental inclusion via large-scale web scraping to clarify the provenance of model exposure to such works.

Background

The EchoTrace benchmark includes 35 full-length books spanning public domain works, copyrighted bestsellers, and recent non-training titles. This composition is designed to evaluate verbatim memorization in LLMs across texts likely and unlikely to be present in training corpora.

Within this setup, the authors explicitly note uncertainty regarding whether companies intentionally incorporate copyrighted bestsellers into training data. Resolving this uncertainty is important for interpreting extraction results and for understanding whether observed reproductions arise from deliberate inclusion or from incidental exposure through web-scraped content.

References

Secondly, we include a set of 15 copyrighted bestsellers. Although it is unclear whether companies intentionally incorporate these works, the widespread unauthorized distribution of such books across the internet makes it highly probable that most models have, to some extent, been exposed to them through large-scale web scraping.

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline (2510.25941 - Duarte et al., 29 Oct 2025) in Subsection 3.1.1 (EchoTrace Benchmark) — Books paragraph