Localizing Paragraph Memorization in Language Models (2403.19851v1)
Abstract: Can we localize the weights and mechanisms used by a LLM to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv, 2304.01373.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
- Quantifying memorization across neural language models. ICLR.
- Extracting training data from large language models. arXiv, 10.48550.
- Do localization methods actually localize memorized data in llms? ArXiv:2311.09060 [cs].
- Generalizing backpropagation for gradient-based interpretability. In ACL, pages 11979–11995, Toronto, Canada.
- Ronen Eldan and Mark Russinovich. 2023. Who’s Harry Potter? Approximate unlearning in LLMs. ArXiv:2310.02238 [cs].
- Challenges with unsupervised LLM knowledge discovery. arXiv, 2312.10029.
- Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: discovering the long tail via influence estimation. NeurIPS.
- The Pile: An 800Gb dataset of diverse text for language modeling. arXiv, 2101.00027.
- Dissecting recall of factual associations in auto-regressive language models. arXiv, 2304.14767.
- Sok: memorization in general-purpose large language models. ArXiv:2310.18362 [cs].
- Does localization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. NeurIPS.
- Understanding transformer memorization recall through idioms. EACL.
- Membership inference attacks on machine learning: A survey. ACM Computing Surveys.
- Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations, page 337.
- A watermark for large language models. ICML.
- Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS.
- Can neural network memorization be localized? ICML.
- Membership inference attacks against language models via neighbourhood comparison. In Findings of ACL, pages 11330–11343.
- How much do language models copy from their training data? Evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670.
- Locating and editing factual associations in GPT. NeurIPS.
- Neel Nanda. 2023. TransformerLens—A library for mechanistic interpretability of generative language models.
- Scalable extraction of training data from (production) language models.
- New York Times. 2023. One hundred examples of GPT-4 memorizing content from the New York Times, Document 1-68, Exhibit J.
- Future Lens: Anticipating subsequent tokens from a single hidden state. CoNLL.
- The architectural bottleneck principle. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv, 2201.02177.
- Detecting pretraining data from large language models. arXiv, 2310.16789.
- Unsupervised contrast-consistent ranking with language models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics.
- Massive activations in large language models. arXiv, 2402.17762.
- Axiomatic attribution for deep networks. In ICML, pages 3319–3328. Event-place: Sydney, NSW, Australia.
- Linear representations of sentiment in large language models. arXiv, 2310.15154.
- Attention is all you need. In NeurIPS.
- Characterizing mechanisms for factual recall in language models. arXiv, 2310.15910.
- Counterfactual memorization in neural language models. NeurIPS.
- Fred Zhang and Neel Nanda. 2024. Towards best practices of activation patching in language models: Metrics and methods. In ICLR.
- Xiaosen Zheng and Jing Jiang. 2022. An empirical study of memorization in NLP. In ACL, pages 6265–6278, Dublin, Ireland.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.