Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 85 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Kimi K2 186 tok/s Pro
2000 character limit reached

Localizing Paragraph Memorization in Language Models (2403.19851v1)

Published 28 Mar 2024 in cs.CL, cs.CR, cs.LG, and stat.ML

Abstract: Can we localize the weights and mechanisms used by a LLM to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Pythia: A suite for analyzing large language models across training and scaling. arXiv, 2304.01373.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
  3. Quantifying memorization across neural language models. ICLR.
  4. Extracting training data from large language models. arXiv, 10.48550.
  5. Do localization methods actually localize memorized data in llms? ArXiv:2311.09060 [cs].
  6. Generalizing backpropagation for gradient-based interpretability. In ACL, pages 11979–11995, Toronto, Canada.
  7. Ronen Eldan and Mark Russinovich. 2023. Who’s Harry Potter? Approximate unlearning in LLMs. ArXiv:2310.02238 [cs].
  8. Challenges with unsupervised LLM knowledge discovery. arXiv, 2312.10029.
  9. Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: discovering the long tail via influence estimation. NeurIPS.
  10. The Pile: An 800Gb dataset of diverse text for language modeling. arXiv, 2101.00027.
  11. Dissecting recall of factual associations in auto-regressive language models. arXiv, 2304.14767.
  12. Sok: memorization in general-purpose large language models. ArXiv:2310.18362 [cs].
  13. Does localization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. NeurIPS.
  14. Understanding transformer memorization recall through idioms. EACL.
  15. Membership inference attacks on machine learning: A survey. ACM Computing Surveys.
  16. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations, page 337.
  17. A watermark for large language models. ICML.
  18. Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS.
  19. Can neural network memorization be localized? ICML.
  20. Membership inference attacks against language models via neighbourhood comparison. In Findings of ACL, pages 11330–11343.
  21. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670.
  22. Locating and editing factual associations in GPT. NeurIPS.
  23. Neel Nanda. 2023. TransformerLens—A library for mechanistic interpretability of generative language models.
  24. Scalable extraction of training data from (production) language models.
  25. New York Times. 2023. One hundred examples of GPT-4 memorizing content from the New York Times, Document 1-68, Exhibit J.
  26. Future Lens: Anticipating subsequent tokens from a single hidden state. CoNLL.
  27. The architectural bottleneck principle. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  28. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv, 2201.02177.
  29. Detecting pretraining data from large language models. arXiv, 2310.16789.
  30. Unsupervised contrast-consistent ranking with language models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics.
  31. Massive activations in large language models. arXiv, 2402.17762.
  32. Axiomatic attribution for deep networks. In ICML, pages 3319–3328. Event-place: Sydney, NSW, Australia.
  33. Linear representations of sentiment in large language models. arXiv, 2310.15154.
  34. Attention is all you need. In NeurIPS.
  35. Characterizing mechanisms for factual recall in language models. arXiv, 2310.15910.
  36. Counterfactual memorization in neural language models. NeurIPS.
  37. Fred Zhang and Neel Nanda. 2024. Towards best practices of activation patching in language models: Metrics and methods. In ICLR.
  38. Xiaosen Zheng and Jing Jiang. 2022. An empirical study of memorization in NLP. In ACL, pages 6265–6278, Dublin, Ireland.
Citations (10)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper reveals that specific tokens in paragraph prefixes act as memorization triggers, leading to exact retrievals from training data.
  • It employs gradient-based parameter attribution on GPT-Neo 125M to localize key model components, notably identifying attention head 2 in layer 1.
  • The study demonstrates that targeted sparse unlearning can modify memorized outputs, offering actionable insights into model control and privacy enhancements.

Localizing and Understanding Paragraph Memorization in LLMs

Introduction to Paragraph Memorization

In the research domain of LLMs, understanding the phenomenon of paragraph memorization is essential for advancing our knowledge on how these models store and retrieve long sequences of text from their training data. A significant challenge is localizing the model components responsible for memorizing entire paragraphs, as such memorization may have implications for both model performance and privacy. This blog post summarizes recent findings on identifying and characterizing the model internals that contribute to paragraph memorization, focusing on an in-depth analysis of the GPT-Neo 125M model trained on the publicly available Pile dataset.

Paragraph Memorization: Definition and Metrics

The paper defines paragraph memorization as the ability of a LLM to produce an exact continuation of a given prefix from its training set. Two key metrics are employed to evaluate memorization:

  • Exact Match (EM): The count of tokens in the model-generated continuation that exactly match the true continuation, up to a maximum of 50 tokens.
  • Negative Log-Likelihood (NLL): Measures how likely the model considers the true continuation, with lower values indicating higher likelihoods.

Based on these metrics, paragraphs are categorized into memorized and non-memorized sets, facilitating a comparative analysis of the model's behavior across these two categories.

Discovering Memorization Triggers in Prefixes

An intriguing finding is that specific tokens within the prefix of memorized paragraphs can significantly affect the model's generation, acting as "memorization triggers." By perturbing these tokens, the model's output diverges from the memorized continuation, often resulting in equally plausible but non-memorized alternatives. This suggests that memorization may be linked to distinctive, rare tokens within the prefix that act as key anchors for retrieving the stored information.

Gradient-based Localization of Memorization Components

To further understand the internal mechanisms behind memorization, the paper employs gradient-based parameter attribution. By analyzing the gradients of memorized and non-memorized paragraphs, it was observed that gradients tend to be larger in lower layers for memorized paragraphs and in higher layers for non-memorized paragraphs. Specifically, attention to details revealed that the attention head 2 in layer 1 of GPT-Neo 125M shows distinct gradient patterns associated with memorization, suggesting its significant role in processing rare or unique tokens within paragraph prefixes.

Sparse Unlearning and Editing of Memorization

The localized understanding of memorization components allows for targeted interventions, such as sparse unlearning and editing of memorized paragraphs. By fine-tuning only the most relevant weights identified through gradient attribution, the model can be effectively "unlearned" or edited to produce different continuations for previously memorized paragraphs. This supports the hypothesis that localized model components significantly contribute to paragraph memorization.

Implications and Future Directions

The identification of specific model components, particularly the attention head in layer 1, as crucial to paragraph memorization has several implications. It opens up avenues for further research into model interpretability, as understanding the role of individual model components can lead to more explainable AI. Additionally, the methods developed for localizing memorization components and manipulating memorized content offer promising approaches for addressing privacy concerns related to unintended memorization in LLMs.

In conclusion, the paper provides valuable insights into the mechanisms of paragraph memorization in LLMs, highlighting the potential for targeted modifications to control memorization behavior. As the field of LLM research advances, such findings will be critical in developing models that are both powerful and aligned with ethical considerations regarding data privacy and usage.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.