Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DE-COP: Detecting Copyrighted Content in Language Models Training Data (2402.09910v2)

Published 15 Feb 2024 in cs.CL and cs.LG

Abstract: How can we detect if copyrighted content was used in the training process of a LLM, considering that the training data is typically undisclosed? We are motivated by the premise that a LLM is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approach is to probe an LLM with multiple-choice questions, whose options include both verbatim text and their paraphrases. We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff, along with their paraphrases. Our experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance (AUC) on models with logits available. Moreover, DE-COP also achieves an average accuracy of 72% for detecting suspect books on fully black-box models where prior methods give approximately 4% accuracy. The code and datasets are available at https://github.com/LeiLiLab/DE-COP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Anthropic. Claude 2. https://www.anthropic.com/news/claude-2, 2023. Accessed: 2023-11-07.
  2. Brittain, B. Artists take new shot at Stability, Midjourney in updated copyright lawsuit. Reuters, 2023. URL https://www.reuters.com/legal/litigation/artists-take-new-shot-stability-midjourney-updated-copyright-lawsuit-2023-11-30/.
  3. Language Models are Few-Shot Learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020.
  4. Extracting Training Data from Large Language Models. In USENIX Security Symposium, 2020.
  5. Membership Inference Attacks From First Principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  1897–1914, Los Alamitos, CA, USA, may 2022a. IEEE Computer Society. doi: 10.1109/SP46214.2022.9833649.
  6. Quantifying Memorization Across Neural Language Models. ArXiv, abs/2202.07646, 2022b.
  7. Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118, 2023.
  8. Can Copyright be Reduced to Privacy? arXiv preprint arXiv:2305.14822, 2023.
  9. Feldman, V. Does Learning Require Memorization? A Short Tale about a Long Tail. arXiv preprint arXiv:1906.05271, 2021.
  10. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020.
  11. The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work. The New York Times, 2023. URL https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
  12. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  13. Mixtral of Experts. arXiv preprint arXiv:2401.04088, 2024.
  14. Copyright Violations and Large Language Models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  7403–7412, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.458.
  15. LLM360: Towards Fully Transparent Open-Source LLMs. arXiv preprint arXiv:2312.06550, 2023.
  16. Understanding Membership Inferences on Well-Generalized Learning Models. arXiv preprint arXiv:1802.04889, 2018.
  17. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8332–8347, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.570.
  18. Scalable Extraction of Training Data from (Production) Language Models. arXiv preprint arXiv:2311.17035, 2023.
  19. OpenAI. Introducing Chat-GPT. https://openai.com/blog/chatgpt, 2022. Accessed: 2022-11-30.
  20. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  21. Proving Test Set Contamination in Black Box Language Models. arXiv preprint arXiv:2310.17623, 2023.
  22. Language Models are Unsupervised Multitask Learners. 2019.
  23. Detecting Pretraining Data from Large Language Models. arXiv preprint arXiv:2310.16789, 2023.
  24. Membership Inference Attacks Against Machine Learning Models. In 2017 IEEE Symposium on Security and Privacy (SP), pp.  3–18, Los Alamitos, CA, USA, may 2017. IEEE Computer Society. doi: 10.1109/SP.2017.41.
  25. The Authors Guild. AG Recommends Clause in Publishing and Distribution Agreements Prohibiting AI Training Uses. https://authorsguild.org/news/model-clause-prohibiting-ai-training/, 2023. Accessed: 2023-03-01.
  26. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
  27. On Provable Copyright Protection for Generative Models. arXiv preprint arXiv:2302.10870, 2023.
  28. On the Importance of Difficulty Calibration in Membership Inference Attacks. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  29. Counterfactual Memorization in Neural Language Models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  30. Investigating Copyright Issues of Diffusion Models under Practical Scenarios. arXiv preprint arXiv:2311.12803, 2023b.
  31. Provably confidential language modelling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  943–955, 2022.
  32. Large Language Models Are Not Robust Multiple Choice Selectors. In The Twelfth International Conference on Learning Representations, 2024.
Citations (18)

Summary

  • The paper presents DE-COP, a novel method using multiple-choice probing to reliably detect copyrighted text in LLM training datasets.
  • It leverages benchmark datasets from books and research papers to achieve a 9.6% improvement in AUC and 72% accuracy on black-box models.
  • The approach offers practical benefits for legal compliance and deepens our understanding of model memorization and data privacy in AI.

An Analysis of DE-COP: Detecting Copyrighted Content in LLMs Training Data

The paper "DE-COP: Detecting Copyrighted Content in LLMs Training Data" addresses a pressing issue in the field of AI and ethics: identifying copyrighted content in the training data of LLMs. As LLMs grow in complexity and utility, the ethical implications of their training data become even more significant, particularly when such data may contain copyrighted material. The authors present DE-COP, a novel methodology designed to determine whether particular copyrighted text has been included in the training corpus of LLMs.

Methodological Overview

DE-COP operates on the principle of probing LLMs using multiple-choice questions to discern between verbatim content and its paraphrased versions. The core assumption is that LLMs can correctly identify text they were trained on more frequently than non-training data. To substantiate their methodology, the authors introduce two datasets—BookTection and arXivTection—comprising excerpts from books and research papers, respectively. These datasets serve as benchmarks for evaluating the effectiveness of DE-COP compared to prior methods.

Experimentation shows that DE-COP significantly advances the state-of-the-art in detecting copyrighted content. On models with logits available, DE-COP achieves a 9.6% improvement over the prior best method (AUC), and even on fully black-box models where previous methods reported approximately 4% accuracy, DE-COP reaches an impressive 72%. This positions DE-COP as not only more accurate but also versatile across different model configurations.

Implications and Observations

The implications of DE-COP extend across both practical and theoretical dimensions. Practically, the methodology has profound implications for the AI industry's compliance with intellectual property laws. By facilitating the identification of copyrighted content within training datasets, DE-COP provides a pathway for AI developers to ensure their models respect copyright legislation, thereby avoiding potential legal liabilities and enhancing public trust in AI technologies.

Theoretically, DE-COP contributes to ongoing discussions around membership inference, a field primarily concerned with deducing whether a specific piece of data was part of a model's training set. By innovatively applying multiple-choice questions and leveraging paraphrased content, DE-COP offers an alternative to traditional methods that often require access to token probabilities, thus extending its applicability to black-box models like ChatGPT or Claude.

Interestingly, the paper highlights a gap between human and machine capabilities in this context, noting that human annotators struggle to identify whether content was part of a model's training data, whereas models achieve high accuracy on these tasks. This observation suggests that LLMs harbor more detailed knowledge from their training data than previously recognized, reinforcing concerns related to data privacy and the potential memorization of sensitive or proprietary information.

Future Directions

Looking toward the future, the development of tools like DE-COP can influence the trajectory of AI in several ways. First, it encourages transparency and accountability in the AI sector, potentially leading to the establishment of standard practices for auditing training data. Second, the methodology may inspire further research into the nuances of model memorization, especially how models retain and utilize information from their training corpus. Future iterations of DE-COP might refine its accuracy or expand its scope, such as detecting more abstract forms of intellectual property beyond verbatim text.

In conclusion, DE-COP provides a significant contribution to the landscape of AI ethics and legal compliance, offering a refined tool for identifying copyrighted content in the training data of LLMs. Its ability to surpass existing methods demonstrates the potential of innovative approaches in addressing ethical and technical challenges in AI development. This paper thus lays the groundwork for enhancing both the responsibility and the robustness of LLMs moving forward.

Youtube Logo Streamline Icon: https://streamlinehq.com