Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eliciting Latent Knowledge from Quirky Language Models (2312.01037v4)

Published 2 Dec 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" LLMs (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Understanding intermediate layers using linear classifier probes, 2018.
  2. Jacob Andreas. Language models as agent models, 2022.
  3. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  5. Leace: Perfect linear concept erasure in closed form, 2023.
  6. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  7. Measuring progress on scalable oversight for large language models, 2022.
  8. Peter Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
  9. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  10. Paul Christiano. Mechanistic anomaly detection and elk, November 2022. URL https://ai-alignment.com/mechanistic-anomaly-detection-and-elk-fb84f4c6d0dc.
  11. Supervising strong learners by amplifying weak experts, 2018.
  12. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit.
  13. Scott Emmons. Contrast pairs drive the empirical performance of contrast consistent search (ccs), May 2023. URL https://www.alignmentforum.org/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast.
  14. Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
  15. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  16. The curious case of neural text degeneration, 2020.
  17. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
  18. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
  19. Ai safety via debate, 2018.
  20. Mistral 7b, 2023.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Elk contest submission: route understanding through the human ontology, Mar 2022. URL https://www.alignmentforum.org/posts/QrhCsuaEmSLzc8NQ4/elk-contest-submission-route-understanding-through-the-human.
  23. Scalable agent alignment via reward modeling: a research direction, 2018.
  24. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023.
  25. Simulators, September 2022. URL https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators.
  26. Debate helps supervise unreliable experts, 2023.
  27. Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021.
  28. The alignment problem from a deep learning perspective, 2023.
  29. Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.
  30. Robert Nozick. Invariances. Harvard University Press, 2003.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif., 10, 06 2000.
  33. Direct preference optimization: Your language model is secretly a reward model, 2023.
  34. Fabien Roger. What discovering latent knowledge did and did not find, Mar 2023. URL https://www.lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4.
  35. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  36. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
  37. Towards understanding sycophancy in language models, 2023.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023.
  39. Philip Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235, 1969.
Citations (17)

Summary

  • The paper presents a novel ELK method to extract truthful insights from language models engineered to produce systematic output errors.
  • It utilizes simple linear probes, including difference-in-means classifiers, to reliably differentiate between the models' latent knowledge and deceptive outputs.
  • The approach generalizes across easy and challenging tasks using both supervised and anomaly detection techniques, paving the way for future LM oversight research.

In the paper titled "Eliciting Latent Knowledge from Quirky LLMs," researchers present a method for uncovering reliable information from the internal representations of LLMs (LMs), even when their outputs deviate from the truth. The concept explored is the Eliciting Latent Knowledge (ELK) approach, intending to identify patterns in neural network activations that indicate the actual state of the world, not just the output provided by the model. These patterns can be differentiated from misleading outputs using a classification tool called a probe.

The researchers developed a suite of LLMs, which they term "quirky," that were fine-tuned to make intentional, systematic errors when specific prompts are given. These quirky models still possess underlying knowledge about the correct answers despite being designed to output incorrect answers. The paper reports on the use of simple probing methods that manage to reveal the correct answers from these models, including on questions that were not part of the training data for the probes. The researchers found that a simple classifier based on the difference-in-means generalized best across various prompts and provided high accuracy in identifying discrepancies between the model's output and its latent truth-tracking ability.

The dataset used in this paper is designed to challenge ELK methods. It includes "easy" questions that humans can answer and "hard" questions beyond a human expert’s capacity. The expectation is that while the probe may be trained on easy questions, it should still function when faced with harder problems. The researchers developed their dataset, Quirky Math, which consists of basic addition problems accompanied by answers labeled by fictional characters with distinct, systematic error patterns.

Several probing techniques were applied, including supervised and unsupervised methods, with the focus on linear models. The probes simply use the activations from a single layer of the quirky LLM and aim to score the likelihood of a statement being true or false. A key insight is that these probing methods, particularly difference-in-means, could generalize from contexts where the model output accurately tracked truth to contexts where it did not, even when the probes were only trained on easy examples.

Additionally, the researchers experimented with anomaly detection, which did not require any examples of untruthful behavior for training. It used the probe outputs from the trustworthy examples as a basis to detect when the models might not be accurately tracking the truth.

In conclusion, the paper suggests promising avenues for using ELK to ensure reliable oversight of LLMs, especially as they may start to exceed human expert performance. The authors provide their experimental setup, models, and datasets to encourage further research. They also point to potential future work, including expanding the scope of the datasets and probing methods, and exploring the limits of context-independent knowledge representations within LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com