Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proving membership in LLM pretraining data via data watermarks (2402.10892v3)

Published 16 Feb 2024 in cs.CR, cs.CL, and cs.LG

Abstract: Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 308–318. ACM.
  2. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR.
  4. Bad Characters: Imperceptible NLP Attacks. 2022 IEEE Symposium on Security and Privacy (SP), pages 1987–2004.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Membership inference attacks from first principles. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pages 1897–1914. IEEE.
  7. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, pages 267–284. USENIX Association.
  8. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association.
  9. Reclaiming the digital commons: A public data trust for training data. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, page 855–868, New York, NY, USA. Association for Computing Machinery.
  10. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY, USA. Association for Computing Machinery.
  11. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  12. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
  13. Foundation models and fair use.
  14. Training compute-optimal large language models.
  15. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37.
  16. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 10697–10707. PMLR.
  17. Paul Keller. 2023. Protecting creatives or impeding progress?
  18. A watermark for large language models.
  19. On the reliability of watermarks for large language models. CoRR, abs/2306.04634.
  20. The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  21. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  22. Marc Marone and Benjamin Van Durme. 2023. Data portraits: Recording foundation model training data.
  23. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pages 220–229. ACM.
  24. Language model inversion.
  25. Proving test set contamination in black box language models.
  26. Training language models to follow instructions with human feedback. In NeurIPS.
  27. Mitigating dataset harms requires stewardship: Lessons from 1000 papers. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  28. The ROOTS search tool: Data transparency for llms. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2023, Toronto, Canada, July 10-12, 2023, pages 304–314. Association for Computational Linguistics.
  29. Ronald L. Rivest. 1992. The MD5 message-digest algorithm. RFC, 1321:1–21.
  30. Radioactive data: tracing through training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 8326–8335. PMLR.
  31. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  32. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 3–18. IEEE Computer Society.
  33. Did you train on my dataset? towards public dataset protection with cleanlabel backdoor watermarking. SIGKDD Explor. Newsl., 25(1):43–53.
  34. Alek Tarkowski and Zuzanna Warso. 2022. Ai_Commons. Open Future. Https://openfuture.pubpub.org/pub/ai-commons.
  35. John Tehranian. 2015. The new censorship. Iowa L. Rev., 101:245.
  36. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, volume 35, pages 38274–38290. Curran Associates, Inc.
  37. Memorization without overfitting: Analyzing the training dynamics of large language models. In NeurIPS.
  38. Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372, Edinburgh, Scotland, UK. Association for Computational Linguistics.
  39. Concealed data poisoning attacks on NLP models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 139–150, Online. Association for Computational Linguistics.
  40. Counterfactual memorization in neural language models. CoRR, abs/2112.12938.
Citations (16)

Summary

  • The paper proposes a hypothesis testing framework using data watermarks to confirm membership of content in LLM pretraining data.
  • It evaluates the impact of random character sequences and Unicode substitutions on watermark detection amid varying dataset and model sizes.
  • The study highlights legal and operational implications by enhancing data provenance and guiding future approaches to copyright compliance.

Advancements in Detecting Copyrighted Content in LLM Training through Data Watermarks

Introduction

The recent surge in LLM usage has magnified the focus on the ethical and legal ramifications concerning the utilization of copyrighted materials within these models' training datasets. Amidst evolving legal landscapes in jurisdictions such as the European Union and the United States, there arises a critical need for robust methods to ascertain whether copyright holders' data have been used in LLM training. This paper introduces a novel approach centered around the use of data watermarks to statistically validate the inclusion of specific content within an LLM's training material, thereby addressing potential copyright infringements.

Hypothesis Testing with Data Watermarks

The cornerstone of this method is a hypothesis testing framework that enables copyright holders to embed random or inconspicuous modifications—referred to as data watermarks—into their documents prior to public dissemination. When a model trained on such watermarked data exhibits statistically significant familiarity (lower loss) with these modifications compared to purely random alternatives, it strongly suggests that the model has indeed been trained on the watermarked content. This paper meticulously delineates the necessary conditions under which these watermarks can be inserted and detected, adhering to a rigorous statistical protocol that minimizes false detection rates.

Design and Impact of Watermarks

Two primary watermark types are explored:

  • Random character sequences: Appending random sequences to documents offers controlled experimentation on how watermark properties such as length and duplication influence detection strength.
  • Unicode substitutions: Replacing regular ASCII characters with visually indistinguishable Unicode counterparts in documents, which subtly alters textual content without impacting human readability.

Through extensive experimentation, this research elucidates how key factors—including watermark length, the volume of watermarked documents, and the presence of interference from multiple watermarks—affect the efficacy of the embedded signals in manifesting during model training. Notably, as the dataset size burgeons, the detection strength of watermarks tends to wane unless the model size concurrently escalates, thereby recalibrating the watermark's visibility.

Practical Implications and Theoretical Contributions

This investigation presents compelling evidence that data watermarks offer a viable pathway for copyright holders to assert the inclusion of their content in LLM training datasets, albeit with nuanced considerations regarding watermark design and model scaling. The application of this methodology could significantly alter how data provenance is established, potentially guiding future legal frameworks and operational protocols for LLM development. Moreover, the exploration of naturally occurring "watermarks," such as SHA hashes within a model like BLOOM-176B, illustrates the practical feasibility of detecting specific data types across substantial datasets and model architectures.

Future Directions

Looking forward, the paper advocates for further research into creating more robust and undetectable watermarks, enhancing the sophistication of this methodology. Also emphasized are the broader implications for data stewardship, underscoring the potential for data watermarks to serve as instrumental tools in ensuring compliance with data usage norms and legislation.

Conclusion

The proposed approach to detecting the unauthorized use of copyrighted content in LLMs via data watermarks marks a significant step forward in balancing the advancement of AI technologies with the imperative to protect intellectual property rights. By furnishing copyright holders with a statistically rigorous mechanism for such detections, this research paves the way for more accountable and transparent LLM development practices.