Learning and Unlearning of Fabricated Knowledge in Language Models (2410.21750v1)
Abstract: What happens when a new piece of knowledge is introduced into the training data and how long does it last while a LLM (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the LLM hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.
- Emergent and Predictable Memorization in Large Language Models. arXiv e-prints, art. arXiv:2304.11158, April 2023. doi: 10.48550/arXiv.2304.11158.
- Quantifying Memorization Across Neural Language Models. arXiv e-prints, art. arXiv:2202.07646, February 2022. doi: 10.48550/arXiv.2202.07646.
- Poisoning Web-Scale Training Datasets is Practical. arXiv e-prints, art. arXiv:2302.10149, February 2023. doi: 10.48550/arXiv.2302.10149.
- PaLM: Scaling Language Modeling with Pathways. arXiv e-prints, art. arXiv:2204.02311, April 2022. doi: 10.48550/arXiv.2204.02311.
- Crawling the Internal Knowledge-Base of Language Models. arXiv e-prints, art. arXiv:2301.12810, January 2023. doi: 10.48550/arXiv.2301.12810.
- Doyen, S. Behavioral priming: It’s all in the mind, but whose mind? Plos One, 7, 01 2012.
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? arXiv e-prints, art. arXiv:2405.05904, May 2024. doi: 10.48550/arXiv.2405.05904.
- Gemma: Open Models Based on Gemini Research and Technology. arXiv e-prints, art. arXiv:2403.08295, March 2024. doi: 10.48550/arXiv.2403.08295.
- Transformer Feed-Forward Layers Are Key-Value Memories. arXiv e-prints, art. arXiv:2012.14913, December 2020. doi: 10.48550/arXiv.2012.14913.
- Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. arXiv e-prints, art. arXiv:2203.14680, March 2022. doi: 10.48550/arXiv.2203.14680.
- Dissecting Recall of Factual Associations in Auto-Regressive Language Models. arXiv e-prints, art. arXiv:2304.14767, April 2023. doi: 10.48550/arXiv.2304.14767.
- Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. arXiv e-prints, art. arXiv:2401.06102, January 2024. doi: 10.48550/arXiv.2401.06102.
- Artificial curiosity for autonomous space exploration. Acta Futura, pp. 41–51, 01 2011.
- Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. arXiv e-prints, art. arXiv:2301.04213, January 2023. doi: 10.48550/arXiv.2301.04213.
- Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(1), jan 2021. ISSN 1532-4435.
- What Do Compressed Deep Neural Networks Forget? arXiv e-prints, art. arXiv:1911.05248, November 2019. doi: 10.48550/arXiv.1911.05248.
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv e-prints, art. arXiv:2311.05232, November 2023. doi: 10.48550/arXiv.2311.05232.
- Biological underpinnings for lifelong learning machines. Nature Machine Intelligence, 4:196–210, 03 2022. doi: 10.1038/s42256-022-00452-0.
- Weight Poisoning Attacks on Pre-trained Models. arXiv e-prints, art. arXiv:2004.06660, April 2020. doi: 10.48550/arXiv.2004.06660.
- Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 1995. doi: 10.1037/0033-295X.102.3.419.
- Integration of new information in memory: new insights from a complementary learning systems perspective. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1799):20190637, 2020. doi: 10.1098/rstb.2019.0637. URL https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2019.0637.
- Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning. arXiv e-prints, art. arXiv:2404.00213, March 2024. doi: 10.48550/arXiv.2404.00213.
- Locating and Editing Factual Associations in GPT. arXiv e-prints, art. arXiv:2202.05262, February 2022a. doi: 10.48550/arXiv.2202.05262.
- Mass-Editing Memory in a Transformer. arXiv e-prints, art. arXiv:2210.07229, October 2022b. doi: 10.48550/arXiv.2210.07229.
- Progress measures for grokking via mechanistic interpretability. arXiv e-prints, art. arXiv:2301.05217, January 2023a. doi: 10.48550/arXiv.2301.05217.
- Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level. December 2023b.
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv e-prints, art. arXiv:2312.05934, December 2023. doi: 10.48550/arXiv.2312.05934.
- Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=GhVS8_yPeEa.
- How Much Knowledge Can You Pack Into the Parameters of a Language Model? arXiv e-prints, art. arXiv:2002.08910, February 2020. doi: 10.48550/arXiv.2002.08910.
- Learning in deep neural networks and brains with similarity-weighted interleaved learning. Proceedings of the National Academy of Sciences, 119, 07 2022. doi: 10.1073/pnas.2115229119.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Concealed Data Poisoning Attacks on NLP Models. arXiv e-prints, art. arXiv:2010.12563, October 2020. doi: 10.48550/arXiv.2010.12563.
- Poisoning Language Models During Instruction Tuning. arXiv e-prints, art. arXiv:2305.00944, May 2023. doi: 10.48550/arXiv.2305.00944.
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv e-prints, art. arXiv:1905.00537, May 2019. doi: 10.48550/arXiv.1905.00537.
- Finetuned Language Models Are Zero-Shot Learners. arXiv e-prints, art. arXiv:2109.01652, September 2021. doi: 10.48550/arXiv.2109.01652.
- TIES-Merging: Resolving Interference When Merging Models. arXiv e-prints, art. arXiv:2306.01708, June 2023. doi: 10.48550/arXiv.2306.01708.
- ALCUNA: Large Language Models Meet New Knowledge. arXiv e-prints, art. arXiv:2310.14820, October 2023. doi: 10.48550/arXiv.2310.14820.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.