Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Mosaic Memory of Large Language Models (2405.15523v2)

Published 24 May 2024 in cs.CL and cs.LG

Abstract: As LLMs become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and show memorization to be a more complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Devin Gulliver (alt account). Monology/pile-uncopyrighted · datasets at hugging face.
  3. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024.
  4. Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153, 2019.
  5. Buse Gul Atli Tekgul and N Asokan. On the effectiveness of dataset watermarking. In Proceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics, pages 93–99, 2022.
  6. Self-diagnosis and large language models: A new front for medical misinformation. arXiv preprint arXiv:2307.04910, 2023.
  7. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
  8. On bootstrapping the roc curve. Advances in Neural Information Processing Systems, 21, 2008.
  9. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems, 36, 2024.
  10. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  11. Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970.
  12. The foundation model transparency index. arXiv preprint arXiv:2310.12941, 2023.
  13. Elephants never forget: Memorization and learning of tabular data in large language models. arXiv preprint arXiv:2404.06209, 2024.
  14. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997.
  15. Syntactic clustering of the web. Computer networks and ISDN systems, 29(8-13):1157–1166, 1997.
  16. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  17. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022.
  18. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
  19. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  20. Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841, 2024.
  21. De-cop: Detecting copyrighted content in language models training data. arXiv preprint arXiv:2402.09910, 2024.
  22. Croissantllm: A truly bilingual french-english language model. arXiv preprint arXiv:2402.00786, 2024.
  23. FinancialTimes. https://www.ft.com/content/0965d962-5c54-4fdc-aef8-18e4ef3b9df5, Oct 2023.
  24. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  25. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  26. Michael Hart. Project gutenberg.
  27. Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 123–129, 2018.
  28. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  29. Training compute-optimal large language models, 2022.
  30. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics, 4(8):e1000167, 2008.
  31. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  32. Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912.
  33. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  34. Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
  35. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
  36. Copyright violations and large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  37. Kate Knibbs. The battle over books3 could change ai forever. wired-battle-over-books3, 2023.
  38. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36, 2024.
  39. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
  40. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
  41. Open-sourced dataset protection via backdoor watermarking. arXiv preprint arXiv:2010.05821, 2020.
  42. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  43. LLMLitigation. Kadrey, silverman, golden v meta platforms, inc. https://llmlitigation.com/pdf/03417/kadrey-meta-complaint.pdf, 2023.
  44. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 346–363. IEEE, 2023.
  45. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
  46. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
  47. Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
  48. Did the neurons read your book? document-level membership inference for large language models. arXiv preprint arXiv:2310.15007, 2023.
  49. Copyright traps for large language models. arXiv preprint arXiv:2402.09363, 2024.
  50. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory. In The Twelfth International Conference on Learning Representations, 2023.
  51. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  52. Comprehensive privacy analysis of deep learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), pages 1–15, 2018.
  53. NewYorkTimes. The times sues openai and microsoft over a.i. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html, Dec 2023.
  54. OpenAI. Gpt-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf, 2023.
  55. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  56. Near-duplicate sequence search at scale for large language model memorization evaluation. Proceedings of the ACM on Management of Data, 1(2):1–18, 2023.
  57. Katherine Pully. Gutenberg scraper. https://github.com/kpully/gutenberg_scraper, 2020.
  58. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  59. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  61. Alex Reisner. These 183,000 books are fueling the biggest fight in publishing and tech. the-atlantic-books3-copyright, 2023.
  62. Radioactive data: tracing through training. In International Conference on Machine Learning, pages 8326–8335. PMLR, 2020.
  63. White-box vs black-box: Bayes optimal strategies for membership inference. In International Conference on Machine Learning, pages 5558–5567. PMLR, 2019.
  64. Did chatgpt cheat on your test, 2023.
  65. Pamela Samuelson. Generative ai meets copyright. Science, 381(6654):158–161, 2023.
  66. Rylan Schaeffer. Pretraining on the test set is all you need. arXiv preprint arXiv:2309.08632, 2023.
  67. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 31, 2018.
  68. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  69. Finding near-replicas of documents on the web. In International Workshop on the World Wide Web and Databases, pages 204–212. Springer, 1998.
  70. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  71. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, June 2023.
  72. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
  73. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
  74. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security, pages 587–601, 2017.
  75. Understanding unintended memorization in federated learning. arXiv preprint arXiv:2006.07490, 2020.
  76. D4: Improving llm pretraining via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36, 2024.
  77. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023.
  78. USAuthorsGuild. More than 15,000 authors sign authors guild letter calling on ai industry leaders to protect writers. authors-guild-open-letter, 2023.
  79. Wasa: Watermark-based source attribution for large language model-generated data. arXiv preprint arXiv:2310.00646, 2023.
  80. Proving membership in llm pretraining data via data watermarks. arXiv preprint arXiv:2402.10892, 2024.
  81. Data isotopes for data provenance in dnns. Proceedings on Privacy Enhancing Technologies, 2024.
  82. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  83. To repeat or not to repeat: Insights from scaling llm under token-crisis. Advances in Neural Information Processing Systems, 36, 2024.
  84. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018.
  85. Pangu-a: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369, 2021.
  86. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  87. Make them spill the beans! coercive knowledge extraction from (production) llms. arXiv preprint arXiv:2312.04782, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Igor Shilov (12 papers)
  2. Matthieu Meeus (12 papers)
  3. Yves-Alexandre de Montjoye (33 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com