Large Language Models Relearn Removed Concepts (2401.01814v1)

Published 3 Jan 2024 in cs.AI

Abstract: Advances in model editing through neuron pruning hold promise for removing undesirable concepts from LLMs. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.

References (20)

Citations (7)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs quickly recover lost capabilities post neuron pruning by relearning removed semantic concepts.
It employs fine-tuning on named entity recognition to track how pruned neurons are compensated by earlier, semantically-primed neurons.
Findings reveal the rise of polysemantic neurons, underscoring both opportunities for performance gains and challenges for AI safety.

Understanding Neuroplasticity in LLMs

Introduction

The concept of neuroplasticity, widely recognized in biological brains, has an equivalent in the world of artificial intelligence, particularly within LLMs. LLMs encode an extensive range of semantic concepts, facilitating their roles in a variety of natural language processing tasks. Striving to refine these models often involves the removal of specific neurons - believed to house these very semantic concepts - to adjust the model's outputs or imbue it with new capabilities. Nonetheless, the resilience of these models in regaining lost functions after such pruning is a subject worth exploring to better understand their adaptive prowess and the challenges in model editing for a safer AI.

Pruning and Performance Recovery

The paper examined models fine-tuned for named entity recognition to understand how they regain conceptual understanding after targeted pruning of neurons. The researchers found that upon removing important neurons associated with certain concepts, the models experienced a drastic performance decline. Remarkably, after retraining for only a few epochs, the models not only reacquired the lost capabilities but occasionally surpassed the original performance levels. This swift rebound underscores a notable characteristic of LLMs: their ability to adapt and redistribute pruned concepts, facilitating rapid recovery of performance.

Redistribution of Concepts

The paper's deeper analysis revealed intriguing patterns in the process of concept redistribution. When significant concept neurons were pruned, the associated concepts didn't just vanish. Instead, they were reallocated to neurons in earlier layers of the models, which initially handled similar concepts. It appears that these neurons, by virtue of their pre-existing semantic associations, were 'primed' for relearning the pruned concepts. As retraining proceeds, these primed neurons absorb the purged information, restoring the model's conceptual framework.

Advanced Insights: Polysemantic Neurons

One of the striking observations was the emergence of what the researchers term as 'polysemantic neurons.' These involve neurons capable of handling multiple concepts simultaneously, blending the new and the old. As the pruned concept gets integrated into earlier model layers, neurons begin encapsulating a mix of previously and newly acquired information, hence becoming polysemantic. This phenomenon adds another layer of complexity to the already intricate structure of LLMs, posing further implications for model editing and the interpretability of AI systems.

Implications and Further Research

The discovery of such neuroplastic behavior has profound implications for AI safety and model refinement. Given models' proclivity for relearning removed concepts, efforts to excise undesirable features from LLMs must anticipate the potential for these features to resurface. Consequently, continuous monitoring and perhaps an iterative model editing approach during retraining might be necessitated to ensure undesired concepts remain expunged. Future research might advance this field by investigating difference in neuroplasticity across various model architectures and sizes, as well as developing scalable methods to analyze vast numbers of neurons.

Conclusion

This investigation into neuroplasticity reaffirms the adaptability of LLMs; their ability to recover and regroup post-pruning poses both challenges and opportunities for AI development. As the field progresses, understanding and harnessing this resilience will be vital to enhancing the safety, fairness, and efficacy of these powerful systems. As the AI community continues to push the boundaries, strategies to mitigate relearning of unsafe concepts will become increasingly crucial in the quest for robust and reliable machine intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/FazlBarez/status/1743441580479115458

https://twitter.com/fly51fly/status/1743756737000432050

https://twitter.com/knishimae0531/status/1743483576266665999

https://twitter.com/arxivsanitybot/status/1743625054858944696