Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Learning and Unlearning of Fabricated Knowledge in Language Models (2410.21750v1)

Published 29 Oct 2024 in cs.CL and cs.AI

Abstract: What happens when a new piece of knowledge is introduced into the training data and how long does it last while a LLM (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the LLM hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Emergent and Predictable Memorization in Large Language Models. arXiv e-prints, art. arXiv:2304.11158, April 2023. doi: 10.48550/arXiv.2304.11158.
  2. Quantifying Memorization Across Neural Language Models. arXiv e-prints, art. arXiv:2202.07646, February 2022. doi: 10.48550/arXiv.2202.07646.
  3. Poisoning Web-Scale Training Datasets is Practical. arXiv e-prints, art. arXiv:2302.10149, February 2023. doi: 10.48550/arXiv.2302.10149.
  4. PaLM: Scaling Language Modeling with Pathways. arXiv e-prints, art. arXiv:2204.02311, April 2022. doi: 10.48550/arXiv.2204.02311.
  5. Crawling the Internal Knowledge-Base of Language Models. arXiv e-prints, art. arXiv:2301.12810, January 2023. doi: 10.48550/arXiv.2301.12810.
  6. Doyen, S. Behavioral priming: It’s all in the mind, but whose mind? Plos One, 7, 01 2012.
  7. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? arXiv e-prints, art. arXiv:2405.05904, May 2024. doi: 10.48550/arXiv.2405.05904.
  8. Gemma: Open Models Based on Gemini Research and Technology. arXiv e-prints, art. arXiv:2403.08295, March 2024. doi: 10.48550/arXiv.2403.08295.
  9. Transformer Feed-Forward Layers Are Key-Value Memories. arXiv e-prints, art. arXiv:2012.14913, December 2020. doi: 10.48550/arXiv.2012.14913.
  10. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. arXiv e-prints, art. arXiv:2203.14680, March 2022. doi: 10.48550/arXiv.2203.14680.
  11. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. arXiv e-prints, art. arXiv:2304.14767, April 2023. doi: 10.48550/arXiv.2304.14767.
  12. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. arXiv e-prints, art. arXiv:2401.06102, January 2024. doi: 10.48550/arXiv.2401.06102.
  13. Artificial curiosity for autonomous space exploration. Acta Futura, pp.  41–51, 01 2011.
  14. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. arXiv e-prints, art. arXiv:2301.04213, January 2023. doi: 10.48550/arXiv.2301.04213.
  15. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(1), jan 2021. ISSN 1532-4435.
  16. What Do Compressed Deep Neural Networks Forget? arXiv e-prints, art. arXiv:1911.05248, November 2019. doi: 10.48550/arXiv.1911.05248.
  17. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv e-prints, art. arXiv:2311.05232, November 2023. doi: 10.48550/arXiv.2311.05232.
  18. Biological underpinnings for lifelong learning machines. Nature Machine Intelligence, 4:196–210, 03 2022. doi: 10.1038/s42256-022-00452-0.
  19. Weight Poisoning Attacks on Pre-trained Models. arXiv e-prints, art. arXiv:2004.06660, April 2020. doi: 10.48550/arXiv.2004.06660.
  20. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 1995. doi: 10.1037/0033-295X.102.3.419.
  21. Integration of new information in memory: new insights from a complementary learning systems perspective. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1799):20190637, 2020. doi: 10.1098/rstb.2019.0637. URL https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2019.0637.
  22. Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning. arXiv e-prints, art. arXiv:2404.00213, March 2024. doi: 10.48550/arXiv.2404.00213.
  23. Locating and Editing Factual Associations in GPT. arXiv e-prints, art. arXiv:2202.05262, February 2022a. doi: 10.48550/arXiv.2202.05262.
  24. Mass-Editing Memory in a Transformer. arXiv e-prints, art. arXiv:2210.07229, October 2022b. doi: 10.48550/arXiv.2210.07229.
  25. Progress measures for grokking via mechanistic interpretability. arXiv e-prints, art. arXiv:2301.05217, January 2023a. doi: 10.48550/arXiv.2301.05217.
  26. Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level. December 2023b.
  27. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv e-prints, art. arXiv:2312.05934, December 2023. doi: 10.48550/arXiv.2312.05934.
  28. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=GhVS8_yPeEa.
  29. How Much Knowledge Can You Pack Into the Parameters of a Language Model? arXiv e-prints, art. arXiv:2002.08910, February 2020. doi: 10.48550/arXiv.2002.08910.
  30. Learning in deep neural networks and brains with similarity-weighted interleaved learning. Proceedings of the National Academy of Sciences, 119, 07 2022. doi: 10.1073/pnas.2115229119.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  32. Concealed Data Poisoning Attacks on NLP Models. arXiv e-prints, art. arXiv:2010.12563, October 2020. doi: 10.48550/arXiv.2010.12563.
  33. Poisoning Language Models During Instruction Tuning. arXiv e-prints, art. arXiv:2305.00944, May 2023. doi: 10.48550/arXiv.2305.00944.
  34. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv e-prints, art. arXiv:1905.00537, May 2019. doi: 10.48550/arXiv.1905.00537.
  35. Finetuned Language Models Are Zero-Shot Learners. arXiv e-prints, art. arXiv:2109.01652, September 2021. doi: 10.48550/arXiv.2109.01652.
  36. TIES-Merging: Resolving Interference When Merging Models. arXiv e-prints, art. arXiv:2306.01708, June 2023. doi: 10.48550/arXiv.2306.01708.
  37. ALCUNA: Large Language Models Meet New Knowledge. arXiv e-prints, art. arXiv:2310.14820, October 2023. doi: 10.48550/arXiv.2310.14820.

Summary

  • The paper demonstrates that knowledge conflicting with established facts can persist for tens of thousands of training steps.
  • The paper shows that injected fabricated facts trigger hallucinations and non-target generalization, affecting unrelated outputs.
  • The paper introduces multi-step sparse updates as an effective method to erase artificial knowledge without impairing core model performance.

Learning and Unlearning of Fabricated Knowledge in LLMs

Overview

The paper "Learning and Unlearning of Fabricated Knowledge in LLMs" presents a focused exploration into the dynamics of memory and knowledge retention in LLMs when confronted with injected artificial facts. Developed by researchers at Google DeepMind, the work introduces a novel probing dataset, termed "Outlandish," which consists of fabricated facts that challenge common knowledge. The objective is to understand how the integration of such facts influences both short- and long-term memory within LLMs, and whether these memories lead to broader impacts such as hallucinations during inference.

Key Findings

  1. Fact Longevity in LLMs: The paper reveals that the endurance of fabricated knowledge in LLMs is significantly varied based on the novelty of the facts. Facts conflicting with established world knowledge endure tens of thousands of training steps, whereas both consistent (mundane) and randomly jumbled facts tend to fade rapidly. This discovery underscores a unique durability associated with knowledge-conflicting facts.
  2. Priming and Hallucination: The research demonstrates that knowledge-conflicting facts can induce hallucinations, where the model generates content influenced by these injected attributes even on unrelated topics. This priming effect, wherein specific tokens or concepts are reused in logically unrelated contexts, is amplified by inconsistent facts, suggesting non-target generalization capacities of LLMs under such conditions.
  3. Mitigating Data Poisoning: The paper offers a compelling method for the erasure of fabricated knowledge—multi-step sparse updates. This approach is shown to effectively erase the lingering memories formed by injected facts without hindering the model's performance on primary tasks. By sparsifying parameter updates, the effects of such knowledge are significantly diminished, presenting a practical technique for addressing data poisoning.

Implications

The implications of these findings are multifaceted, affecting both theoretical understanding and practical implementation of LLMs. The paper highlights the nuanced memory characteristics of LLMs, particularly in how they learn, retain, and potentially misuse injected knowledge. From a safety perspective, understanding this dynamic is crucial, especially in contexts where LLMs might be exposed to adversarial data or misinformation. The proposed sparse update methodology offers a practical solution to mitigate potential risks associated with data poisoning, contributing to the robustness and reliability of AI systems.

Future Directions

This exploration opens avenues for further inquiry into the mechanisms of memory in neural networks, both in terms of their architecture and training paradigms. Future research may explore the relationship between memory retention and model architecture, exploring how different configurations may influence the propensity for non-target generalizations and hallucinations. Moreover, expanding the scope of the Outlandish dataset could yield richer insights into the complexity of LLM memory, potentially informing the design of more interpretability-focused model diagnostics or remediation strategies.

In conclusion, the paper sheds light on the intricate dynamics of learning and forgetting in LLMs, paving the way for more secure and trustworthy AI deployments. The insights and methods outlined not only enhance our understanding but also furnish tangible benefits in aligning LLM behaviors with desired safety and performance standards.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 69 likes.

Upgrade to Pro to view all of the tweets about this paper: