Eliciting Latent Knowledge from Quirky Language Models (2312.01037v4)
Abstract: Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" LLMs (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.
- Understanding intermediate layers using linear classifier probes, 2018.
- Jacob Andreas. Language models as agent models, 2022.
- Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Leace: Perfect linear concept erasure in closed form, 2023.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- Measuring progress on scalable oversight for large language models, 2022.
- Peter Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Paul Christiano. Mechanistic anomaly detection and elk, November 2022. URL https://ai-alignment.com/mechanistic-anomaly-detection-and-elk-fb84f4c6d0dc.
- Supervising strong learners by amplifying weak experts, 2018.
- Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit.
- Scott Emmons. Contrast pairs drive the empirical performance of contrast consistent search (ccs), May 2023. URL https://www.alignmentforum.org/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast.
- Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- The curious case of neural text degeneration, 2020.
- Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
- Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
- Ai safety via debate, 2018.
- Mistral 7b, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Elk contest submission: route understanding through the human ontology, Mar 2022. URL https://www.alignmentforum.org/posts/QrhCsuaEmSLzc8NQ4/elk-contest-submission-route-understanding-through-the-human.
- Scalable agent alignment via reward modeling: a research direction, 2018.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023.
- Simulators, September 2022. URL https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators.
- Debate helps supervise unreliable experts, 2023.
- Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021.
- The alignment problem from a deep learning perspective, 2023.
- Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.
- Robert Nozick. Invariances. Harvard University Press, 2003.
- OpenAI. Gpt-4 technical report, 2023.
- John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif., 10, 06 2000.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Fabien Roger. What discovering latent knowledge did and did not find, Mar 2023. URL https://www.lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
- On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
- Towards understanding sycophancy in language models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Philip Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235, 1969.