Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces (2406.11614v2)

Published 17 Jun 2024 in cs.CL and cs.AI

Abstract: The task of "unlearning" certain concepts in LLMs has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

PDF HTML Abstract

A Formal Analysis of "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"

The paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces" by Yihuai Hong et al. addresses a critical challenge in the ongoing development of LLMs: the unlearning of specific concepts. With increasing attention focused on the necessity to mitigate undesirable model behaviors, such as generating harmful, private, or incorrect information, the authors highlight the inadequacy of current unlearning evaluation methods which largely depend on behavioral tests. This paper challenges this approach and introduces a methodology that emphasizes parametric changes in LLMs when specific knowledge is unlearned.

Methodology and Contributions

The key proposition of the paper is the need for an "intrinsic" evaluation of unlearning methods, contrasting the prevalent behavioral evaluations. The authors argue that monitoring the presence of unlearned knowledge solely through model behavior may leave residual knowledge undetected within the model's parameters. This knowledge can be adversarially exploited to recover the erased information post-unlearning.

To address this, the authors introduce "ConceptVectors," a benchmark dataset composed of hundreds of common concepts and their corresponding parametric knowledge traces in two LLMs: LLaMA and OLMo. These ConceptVectors are derived by projecting parameters to the vocabulary space, yielding what the authors term as "concept vectors." These vectors localize concrete concepts within the model's parameter space, providing a means to perform parametric evaluation of unlearning methods.

The primary contributions of this work are:

Introduction of Concept Vectors: The paper introduces a methodology to derive concept vectors as directions in the parameter space that encode specific concepts. These vectors allow for the observation and manipulation of knowledge encoded during model training.
Benchmark Dataset: Construction of the ConceptVectors benchmark including both intrinsic and behavioral evaluations. The dataset covers 285 diverse concepts localized in the MLP layers of LLaMA and OLMo.
Intrinsic Evaluation Findings: An analysis revealing that existing unlearning methods minimally impact the concept vectors, implying that knowledge remains embedded within the model despite behavioral changes.
Ablation of Concept Vectors: A demonstration that directly ablating concept vectors effectively removes the associated knowledge from the LLMs, thereby significantly diminishing their susceptibility to adversarial manipulation.

Experimental Setup

The experiments conducted utilize a range of unlearning methods:

Gradient-Based Methods: Likelihood Maximization and Gradient Difference.
Preference Optimization Methods: Direct Preference Optimization (DPO), Negative Preference Optimization (NPO), and NPO with KL divergence.
Targeted Model Editing: MEMIT, with variations like empty response and maximum entropy.
Oracle Baseline: Needle, which directly interferes with identified concept vectors.

The results show that gradient-based and preference-based optimization methods, while effective in altering model behavior, induce negligible parametric changes. In contrast, Needle, which specifically targets the parametric knowledge, proves more effective in erasing the concept at its core, significantly reducing the model's susceptibility to attack.

Implications and Future Work

The findings suggest that unlearning methods evaluated solely through behavioral tests may provide a false sense of security. The detection of residual knowledge within the model's parameters underscores the necessity of incorporating intrinsic evaluations in unlearning protocols. Needle's efficacy highlights the potential of developing unlearning techniques that directly target and ablate parametric knowledge traces.

The theoretical implications extend to a broader understanding of knowledge representation in LLMs. Practically, the development and adoption of parametric evaluation techniques can enhance the robustness of AI systems, making them safer and more reliable by ensuring thorough erasure of undesirable information.

Future directions include further exploration of knowledge localization within LLMs, beyond MLP layers, to encompass mechanisms encoded in self-attention modules. Additionally, addressing the challenge of disentangling knowledge in cases where concepts are encoded in superposition remains a significant area for future research.

In conclusion, this work by Yihuai Hong et al. advances the field by providing a robust framework for evaluating and improving unlearning methods in LLMs. The ConceptVectors benchmark and the notion of concept vectors represent a significant step towards more accountable and secure AI systems.