Learning and Unlearning of Fabricated Knowledge in Language Models (2410.21750v1)

Published 29 Oct 2024 in cs.CL and cs.AI

Abstract: What happens when a new piece of knowledge is introduced into the training data and how long does it last while a LLM (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the LLM hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.

References (37)

Summary

The paper demonstrates that knowledge conflicting with established facts can persist for tens of thousands of training steps.
The paper shows that injected fabricated facts trigger hallucinations and non-target generalization, affecting unrelated outputs.
The paper introduces multi-step sparse updates as an effective method to erase artificial knowledge without impairing core model performance.

Learning and Unlearning of Fabricated Knowledge in LLMs

Overview

The paper "Learning and Unlearning of Fabricated Knowledge in LLMs" presents a focused exploration into the dynamics of memory and knowledge retention in LLMs when confronted with injected artificial facts. Developed by researchers at Google DeepMind, the work introduces a novel probing dataset, termed "Outlandish," which consists of fabricated facts that challenge common knowledge. The objective is to understand how the integration of such facts influences both short- and long-term memory within LLMs, and whether these memories lead to broader impacts such as hallucinations during inference.

Key Findings

Fact Longevity in LLMs: The paper reveals that the endurance of fabricated knowledge in LLMs is significantly varied based on the novelty of the facts. Facts conflicting with established world knowledge endure tens of thousands of training steps, whereas both consistent (mundane) and randomly jumbled facts tend to fade rapidly. This discovery underscores a unique durability associated with knowledge-conflicting facts.
Priming and Hallucination: The research demonstrates that knowledge-conflicting facts can induce hallucinations, where the model generates content influenced by these injected attributes even on unrelated topics. This priming effect, wherein specific tokens or concepts are reused in logically unrelated contexts, is amplified by inconsistent facts, suggesting non-target generalization capacities of LLMs under such conditions.
Mitigating Data Poisoning: The paper offers a compelling method for the erasure of fabricated knowledge—multi-step sparse updates. This approach is shown to effectively erase the lingering memories formed by injected facts without hindering the model's performance on primary tasks. By sparsifying parameter updates, the effects of such knowledge are significantly diminished, presenting a practical technique for addressing data poisoning.

Implications

The implications of these findings are multifaceted, affecting both theoretical understanding and practical implementation of LLMs. The paper highlights the nuanced memory characteristics of LLMs, particularly in how they learn, retain, and potentially misuse injected knowledge. From a safety perspective, understanding this dynamic is crucial, especially in contexts where LLMs might be exposed to adversarial data or misinformation. The proposed sparse update methodology offers a practical solution to mitigate potential risks associated with data poisoning, contributing to the robustness and reliability of AI systems.

Future Directions

This exploration opens avenues for further inquiry into the mechanisms of memory in neural networks, both in terms of their architecture and training paradigms. Future research may explore the relationship between memory retention and model architecture, exploring how different configurations may influence the propensity for non-target generalizations and hallucinations. Moreover, expanding the scope of the Outlandish dataset could yield richer insights into the complexity of LLM memory, potentially informing the design of more interpretability-focused model diagnostics or remediation strategies.

In conclusion, the paper sheds light on the intricate dynamics of learning and forgetting in LLMs, paving the way for more secure and trustworthy AI deployments. The insights and methods outlined not only enhance our understanding but also furnish tangible benefits in aligning LLM behaviors with desired safety and performance standards.