Proving membership in LLM pretraining data via data watermarks (2402.10892v3)
Abstract: Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 308–318. ACM.
- GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR.
- Bad Characters: Imperceptible NLP Attacks. 2022 IEEE Symposium on Security and Privacy (SP), pages 1987–2004.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Membership inference attacks from first principles. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pages 1897–1914. IEEE.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, pages 267–284. USENIX Association.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association.
- Reclaiming the digital commons: A public data trust for training data. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, page 855–868, New York, NY, USA. Association for Computing Machinery.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY, USA. Association for Computing Machinery.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
- Foundation models and fair use.
- Training compute-optimal large language models.
- Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 10697–10707. PMLR.
- Paul Keller. 2023. Protecting creatives or impeding progress?
- A watermark for large language models.
- On the reliability of watermarks for large language models. CoRR, abs/2306.04634.
- The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
- Marc Marone and Benjamin Van Durme. 2023. Data portraits: Recording foundation model training data.
- Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pages 220–229. ACM.
- Language model inversion.
- Proving test set contamination in black box language models.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Mitigating dataset harms requires stewardship: Lessons from 1000 papers. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- The ROOTS search tool: Data transparency for llms. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2023, Toronto, Canada, July 10-12, 2023, pages 304–314. Association for Computational Linguistics.
- Ronald L. Rivest. 1992. The MD5 message-digest algorithm. RFC, 1321:1–21.
- Radioactive data: tracing through training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 8326–8335. PMLR.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 3–18. IEEE Computer Society.
- Did you train on my dataset? towards public dataset protection with cleanlabel backdoor watermarking. SIGKDD Explor. Newsl., 25(1):43–53.
- Alek Tarkowski and Zuzanna Warso. 2022. Ai_Commons. Open Future. Https://openfuture.pubpub.org/pub/ai-commons.
- John Tehranian. 2015. The new censorship. Iowa L. Rev., 101:245.
- Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, volume 35, pages 38274–38290. Curran Associates, Inc.
- Memorization without overfitting: Analyzing the training dynamics of large language models. In NeurIPS.
- Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- Concealed data poisoning attacks on NLP models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 139–150, Online. Association for Computational Linguistics.
- Counterfactual memorization in neural language models. CoRR, abs/2112.12938.