Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering (2410.15999v2)

Published 21 Oct 2024 in cs.CL

Abstract: LLMs can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).

PDF HTML Abstract

Steering Knowledge Selection Behaviors in LLMs via SAE-Based Representation Engineering

The paper "Steering Knowledge Selection Behaviors in LLMs via SAE-Based Representation Engineering" addresses the challenge of context-memory knowledge conflicts in LLMs. These conflicts arise when the knowledge inherently stored in the model parameters contradicts the information provided in the context, potentially leading to the use of outdated or incorrect data.

Background and Motivation

LLMs possess a remarkable ability to memorize facts and perform well on knowledge-intensive tasks. Nonetheless, the reliance on parametric knowledge can sometimes be problematic, especially when this knowledge is inconsistent with up-to-date contextual information. The current solution involves retrieval-augmented models that use external contextual data. However, discrepancies between contextual and parametric information can cause undesirable model behaviors, where preference might lean toward incorrect sources.

Existing methods attempt to resolve knowledge conflicts through various means, such as fine-tuning or prompting, but often require additional model interactions, leading to inefficiencies. This paper proposes a novel approach, SpARE, that leverages sparse auto-encoders (SAEs) for efficient behaviour control at inference time, without additional training requirements.

Methodology

SpARE, the proposed method, operates on the principles of representation engineering, utilizing SAEs to steer knowledge selection processes. The approach is training-free and involves three primary steps:

Detection of Knowledge Conflicts: The method identifies knowledge conflicts through analyzing the internal activations of the model. The recognizability of these conflicts within mid-layer residual streams suggests potential layers for intervention.
Functional SAE Activation Identification: SpARE identifies specific activations related to knowledge selection behaviors by calculating mutual information metrics. This enables precise extraction of features that control the knowledge selection.
Activation Editing for Behavior Steering: The method modifies the model's internal activations to steer the choice between contextual and parametric knowledge. It emphasizes selective editing to minimize unintended effects on unrelated features.

Results

The experimental evaluation focused on open-domain question-answering tasks with inherent knowledge conflicts. SpARE demonstrated a superior ability to manage knowledge selection behaviors over existing representation engineering methods and contrastive decoding techniques. Notably, it achieved notable accuracy improvements, with performance gains of up to 15% over contrastive decoding methods and 10% over alternative representation engineering approaches.

Implications and Future Directions

The paper's findings underscore the potential of SAEs in facilitating more nuanced model control without additional training efforts. This capability is particularly relevant for instances where rapid adaptation to new information is critical, such as in real-time applications. The results suggest promising avenues for further research into scalable and efficient techniques for managing complex LLM behaviors.

Future work might explore the applicability of SpARE across a broader range of tasks and model architectures, potentially expanding beyond open-domain question answering. Additionally, refinement of the SAE features to increase interpretability and fine-tune control mechanisms could enhance this approach's utility in diverse AI applications.

In summary, this research contributes a novel, efficient method for dynamically managing knowledge conflicts in LLMs, providing a valuable tool for enhancing the adaptability and reliability of AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Yu Zhao (207 papers)
Alessio Devoto (14 papers)
Giwon Hong (10 papers)
Xiaotang Du (4 papers)
Aryo Pradipta Gema (18 papers)
Hongru Wang (62 papers)
Kam-Fai Wong (92 papers)
Pasquale Minervini (88 papers)
Xuanli He (43 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1850651661699834157

https://twitter.com/PMinervini/status/1860970399246217703

https://twitter.com/yuzhaouoe/status/1851587635430457639

https://twitter.com/PMinervini/status/1888218029487501666

https://twitter.com/PMinervini/status/1925935438394442033

https://twitter.com/yuzhaouoe/status/1851590374768148699