Digger: Detecting Copyright Content Mis-usage in Large Language Model Training (2401.00676v1)
Abstract: Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of LLMs across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.
- 2023. Z-Library. https://zlibrary-africa.se/. (2023).
- Stéphane Aroca-Ouellette and Frank Rudzicz. 2020. On losses for modern language models. arXiv preprint arXiv:2010.01694 (2020).
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022).
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
- Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118 (2023).
- Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874.
- On the definiteness of earth mover’s distance and its relation to set intersection. IEEE transactions on cybernetics 48, 11 (2017), 3184–3196.
- Goodreads. 2023. goodreads-Popular quotes. (2023). https://www.goodreads.com/quotes
- Guardian. 2023. ChatGPT reaches 100 million users two months after launch. (2023). https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
- Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405 (2021).
- Deep learning. nature 521, 7553 (2015), 436–444.
- Do language models plagiarize?. In Proceedings of the ACM Web Conference 2023. 3637–3647.
- Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539 (2023).
- Charles E Metz. 1978. Basic principles of ROC analysis. In Seminars in nuclear medicine, Vol. 8. Elsevier, 283–298.
- Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929 (2022).
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
- Improving language understanding by generative pre-training. (2018).
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Russell Reed and Robert J MarksII. 1999. Neural smithing: supervised learning in feedforward artificial neural networks. Mit Press.
- REUTERS. 2023-02-06. Getty Images lawsuit says Stability AI misused photos to train AI. (2023-02-06). https://www.reuters.com/legal/getty-images-lawsuit-says-stability-ai-misused-photos-train-ai-2023-02-06/
- Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications 10, 1 (2019), 1–9.
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE, 3–18.
- Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 377–390.
- Diffusion earth mover’s distance and distribution embeddings. In International Conference on Machine Learning. PMLR, 10336–10346.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Thomas Vakili and Hercules Dalianis. 2021. Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations.. In HUMAN@ AAAI Fall Symposium.
- Approximating the Earth Mover’s Distance between sets of geometric objects. arXiv preprint arXiv:2104.08136 (2021).
- Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 3093–3106.
- Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF). IEEE, 268–282.
- MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622 (2019).