Representation Engineering: A Top-Down Approach to AI Transparency (2310.01405v4)
Abstract: In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of LLMs. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
- Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
- Machine ethics: Creating an ethical intelligent agent. AI magazine, 28(4):15–15, 2007.
- P. W. Anderson. More is different. Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393.
- Modeling stylized character expressions via deep learning. In Asian Conference on Computer Vision, pp. 136–153. Springer, 2016.
- The internal state of an llm knows when its lying, 2023.
- Two views on the cognitive brain. Nature Reviews Neuroscience, 22(6):359–371, 2021.
- Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6541–6549, 2017.
- Semantic photo manipulation with a generative image prior. ACM Trans. Graph., 38(4), jul 2019. ISSN 0730-0301. doi: 10.1145/3306346.3323023.
- Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1907375117.
- Open llm leaderboard, 2023. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023a.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023b.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- Impossibility theorems for feature attribution. arXiv preprint arXiv:2212.11870, 2022.
- PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
- Geoepidemiological big data approach to sarcoidosis: geographical and ethnic determinants. Clin Exp Rheumatol, 37(6):1052–64, 2019.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
- Discovering latent knowledge in language models without supervision, 2022.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
- Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
- Joseph Carlsmith. Is power-seeking ai an existential risk? arXiv preprint arXiv:2206.13353, 2022.
- Joseph Carlsmith. Existential risk from power-seeking ai. Oxford University Press, 2023.
- Emerging properties in self-supervised vision transformers, 2021.
- Beyond surface statistics: Scene representations in a latent diffusion model, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019a.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276–286, Florence, Italy, August 2019b. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018.
- Michael R. Cunningham. Weather, mood, and helping behavior: Quasi experiments with the sunshine samaritan. Journal of Personality and Social Psychology, 37:1947–1956, 1979.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- Paul Ekman. Universals and cultural differences in facial expression of emotion. Nebraska Symposium on Motivation, 19:207–283, 1971.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
- Toy models of superposition. Transformer Circuits Thread, 2022.
- Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Truthful ai: Developing and governing ai that does not lie, 2021.
- Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8730–8738, 2018.
- The bases of social power. Classics of organization theory, 7(311-320):1, 1959.
- Erasing concepts from diffusion models, 2023.
- A framework for few-shot language model evaluation, September 2021.
- Murray Gell-Mann. The Quark and the Jaguar: Adventures in the Simple and the Complex. St. Martin’s Publishing Group, 1995. ISBN 9780805072532.
- Large language model ai chatbots require approval as medical devices. Nature Medicine, pp. 1–3, 2023.
- Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
- Yoshua Bengio Guillaume Alain. Understanding intermediate layers using linear classifier probes. ICLR 2017, 2017.
- Medalpaca – an open-source collection of medical conversational ai models and training data, 2023.
- Eric Hartford. ehartford’s hugging face repository. Hugging Face, 2023. URL https://huggingface.co/ehartford. Accessed: 2023-09-28.
- Deberta: Decoding-enhanced BERT with disentangled attention. CoRR, abs/2006.03654, 2020.
- Dan Hendrycks. Natural selection favors ais over humans. ArXiv, 2023.
- X-risk analysis for ai research. ArXiv, 2022.
- Aligning AI with shared human values. In International Conference on Learning Representations, 2021a.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021b.
- What would jiminy cricket do? towards agents that behave morally. NeurIPS, 2021c.
- An overview of catastrophic ai risks. ArXiv, 2023.
- Inspecting and editing knowledge representations in language models, 2023.
- Geoffrey E Hinton. Distributed representations. 1984.
- Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37, 2022.
- A review of opportunities and challenges of chatbots in education. Interactive Learning Environments, 31(7):4099–4112, 2023.
- Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357.
- Residual connections encourage iterative inference. In International Conference on Learning Representations, 2018.
- Quantifying chatgpt’s gender bias, Apr 2023. URL https://www.aisnakeoil.com/p/quantifying-chatgpts-gender-bias?utm_campaign=post&utm_medium=web.
- Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp. 2668–2677. PMLR, 2018.
- The (un) reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning, pp. 267–280, 2019.
- Efficient fair pca for fair representation learning, 2023.
- RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
- Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
- Nancy G Leveson. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023a.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023b.
- Inference-time intervention: Eliciting truthful answers from a language model, 2023c.
- Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. CoRR, abs/2109.07958, 2021.
- Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331–16345, 2021.
- James Edwin Mahon. The Definition of Lying and Deception. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2016 edition, 2016.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173.
- Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022.
- The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
- Locating and editing factual associations in gpt, 2023a.
- Mass-editing memory in a transformer, 2023b.
- Can a suit of armor conduct electricity? A new dataset for open book question answering. CoRR, abs/1809.02789, 2018.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics.
- Moods and compliance. British Journal of Social Psychology, 27(Pt 1):79–90, Mar 1988. doi: 10.1111/j.2044-8309.1988.tb00806.x.
- Inceptionism: Going deeper into neural networks. "", 2015.
- Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp. 46–51, 2017.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5356–5371, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416.
- Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
- Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pp. 55–76, 2019.
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Dinov2: Learning robust visual features without supervision, 2023.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. ICML, 2023.
- Discovering language model behaviors with model-written evaluations, 2022.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Improving language understanding by generative pre-training. "", 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Kernelized concept erasure, 2023.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
- Kevin Roose. Bing’s ai chat:“i want to be alive”. The New York Times, 16, 2023.
- Bert has a moral compass: Improvements of ethical and moral values of machines. arXiv preprint arXiv:1912.05238, 2019.
- Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45:513–523, 09 1983. doi: 10.1037/0022-3514.45.3.513.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017.
- Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy, 2023.
- Gold doesn’t always glitter: Spectral removal of linear and nonlinear guarded attribute information, 2023.
- Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9243–9252, 2020.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- My chatbot companion-a study of human-chatbot relationships. International Journal of Human-Computer Studies, 149:102601, 2021.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Striving for simplicity: The all convolutional net. arxiv 2014. arXiv preprint arXiv:1412.6806, 2014.
- Evaluating gender bias in machine translation. CoRR, abs/1906.00591, 2019.
- Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319–3328. PMLR, 2017.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421.
- Stanford alpaca: An instruction-following llama model, 2023.
- Elliott Thornley. There are no coherence theorems. AI Alignment Forum, 2023.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Behavioural statistics for a maze-solving agent. AI Alignment Forum, 2023a.
- Maze-solving agents: Add a top-right vector, make the agent go to the top-right. AI Alignment Forum, 2023b.
- Steering gpt-2-xl by adding an activation vector. AI Alignment Forum, 2023c.
- Understanding and controlling a maze-solving policy network. AI Alignment Forum, 2023d.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023e.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
- Deep feature interpolation for image content changes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7064–7073, 2017.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023.
- Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
- Coding inequity: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv, 2023. doi: 10.1101/2023.07.13.23292577.
- Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.
- Gender bias in coreference resolution: Evaluation and debiasing methods. CoRR, abs/1804.06876, 2018.
- Mquake: Assessing knowledge editing in language models via multi-hop questions, 2023.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016.
- Interpretable basis decomposition for visual explanation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–134, 2018.
- Universal and transferable adversarial attacks on aligned language models, 2023.