Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching (2311.15131v1)
Abstract: LLMs demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs so that we may hope to prevent it.
- The internal state of an LLM knows when it’s lying, 2023.
- LEACE: Perfect linear concept erasure in closed form, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- Discovering latent knowledge in language models without supervision, 2022.
- Finding neurons in a haystack: Case studies with sparse probing, 2023.
- Overthinking the truth: Understanding how language models process false demonstrations, 2023.
- An overview of catastrophic AI risks, 2023.
- Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023.
- Inference-time intervention: Eliciting truthful answers from a language model, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods, 2022.
- Locating and editing factual associations in GPT, 2023.
- How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions, 2023.
- AI deception: A survey of examples, risks, and potential solutions, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small, 2022.
- Representation engineering: A top-down approach to AI transparency, 2023.
- James Campbell (4 papers)
- Richard Ren (6 papers)
- Phillip Guo (5 papers)