Copyright Violations and Large Language Models (2310.13771v1)
Abstract: LLMs may memorize more than just facts, including entire chunks of texts seen during training. Fair use exemptions to copyright laws typically allow for limited use of copyrighted material without permission from the copyright holder, but typically for extraction of information from copyrighted materials, rather than {\em verbatim} reproduction. This work explores the issue of copyright violations and LLMs through the lens of verbatim memorization, focusing on possible redistribution of copyrighted text. We present experiments with a range of LLMs over a collection of popular books and coding problems, providing a conservative characterization of the extent to which LLMs can redistribute these materials. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations. Code is at \url{https://github.com/coastalcph/CopyrightLLMs}.
- Falcon-40B: an open large language model with state-of-the-art performance.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830–839.
- Emergent and predictable memorization in large language models.
- Pythia: A suite for analyzing large language models across training and scaling.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
- Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations.
- Extracting training data from large language models. In USENIX Security Symposium.
- Speak, memory: An archaeology of books known to chatgpt/gpt-4.
- Memorization vs. generalization : Quantifying data leakage in NLP performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, Online. Association for Computational Linguistics.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. OpenAI Blog.
- Controlling the extraction of memorized data from large language models via prompt-tuning.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- Counterfactual memorization in neural language models.
- Opt: Open pre-trained transformer language models.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.