Eliciting Latent Knowledge from Quirky Language Models (2312.01037v4)

Published 2 Dec 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" LLMs (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.

References (39)

Citations (17)

View on Semantic Scholar

Summary

The paper presents a novel ELK method to extract truthful insights from language models engineered to produce systematic output errors.
It utilizes simple linear probes, including difference-in-means classifiers, to reliably differentiate between the models' latent knowledge and deceptive outputs.
The approach generalizes across easy and challenging tasks using both supervised and anomaly detection techniques, paving the way for future LM oversight research.

In the paper titled "Eliciting Latent Knowledge from Quirky LLMs," researchers present a method for uncovering reliable information from the internal representations of LLMs (LMs), even when their outputs deviate from the truth. The concept explored is the Eliciting Latent Knowledge (ELK) approach, intending to identify patterns in neural network activations that indicate the actual state of the world, not just the output provided by the model. These patterns can be differentiated from misleading outputs using a classification tool called a probe.

The researchers developed a suite of LLMs, which they term "quirky," that were fine-tuned to make intentional, systematic errors when specific prompts are given. These quirky models still possess underlying knowledge about the correct answers despite being designed to output incorrect answers. The paper reports on the use of simple probing methods that manage to reveal the correct answers from these models, including on questions that were not part of the training data for the probes. The researchers found that a simple classifier based on the difference-in-means generalized best across various prompts and provided high accuracy in identifying discrepancies between the model's output and its latent truth-tracking ability.

The dataset used in this paper is designed to challenge ELK methods. It includes "easy" questions that humans can answer and "hard" questions beyond a human expert’s capacity. The expectation is that while the probe may be trained on easy questions, it should still function when faced with harder problems. The researchers developed their dataset, Quirky Math, which consists of basic addition problems accompanied by answers labeled by fictional characters with distinct, systematic error patterns.

Several probing techniques were applied, including supervised and unsupervised methods, with the focus on linear models. The probes simply use the activations from a single layer of the quirky LLM and aim to score the likelihood of a statement being true or false. A key insight is that these probing methods, particularly difference-in-means, could generalize from contexts where the model output accurately tracked truth to contexts where it did not, even when the probes were only trained on easy examples.

Additionally, the researchers experimented with anomaly detection, which did not require any examples of untruthful behavior for training. It used the probe outputs from the trustworthy examples as a basis to detect when the models might not be accurately tracking the truth.

In conclusion, the paper suggests promising avenues for using ELK to ensure reliable oversight of LLMs, especially as they may start to exceed human expert performance. The authors provide their experimental setup, models, and datasets to encourage further research. They also point to potential future work, including expanding the scope of the datasets and probing methods, and exploring the limits of context-independent knowledge representations within LLMs.