Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Representation Engineering: A Top-Down Approach to AI Transparency (2310.01405v4)

Published 2 Oct 2023 in cs.LG, cs.AI, cs.CL, cs.CV, and cs.CY

Abstract: In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of LLMs. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (138)
  1. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
  2. Machine ethics: Creating an ethical intelligent agent. AI magazine, 28(4):15–15, 2007.
  3. P. W. Anderson. More is different. Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393.
  4. Modeling stylized character expressions via deep learning. In Asian Conference on Computer Vision, pp.  136–153. Springer, 2016.
  5. The internal state of an llm knows when its lying, 2023.
  6. Two views on the cognitive brain. Nature Reviews Neuroscience, 22(6):359–371, 2021.
  7. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6541–6549, 2017.
  8. Semantic photo manipulation with a generative image prior. ACM Trans. Graph., 38(4), jul 2019. ISSN 0730-0301. doi: 10.1145/3306346.3323023.
  9. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1907375117.
  10. Open llm leaderboard, 2023. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  11. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  12. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023a.
  13. Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023b.
  14. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  15. Impossibility theorems for feature attribution. arXiv preprint arXiv:2212.11870, 2022.
  16. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019.
  17. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
  18. Geoepidemiological big data approach to sarcoidosis: geographical and ethnic determinants. Clin Exp Rheumatol, 37(6):1052–64, 2019.
  19. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  20. Discovering latent knowledge in language models without supervision, 2022.
  21. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  22. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
  23. Joseph Carlsmith. Is power-seeking ai an existential risk? arXiv preprint arXiv:2206.13353, 2022.
  24. Joseph Carlsmith. Existential risk from power-seeking ai. Oxford University Press, 2023.
  25. Emerging properties in self-supervised vision transformers, 2021.
  26. Beyond surface statistics: Scene representations in a latent diffusion model, 2023.
  27. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  28. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019a.
  29. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Florence, Italy, August 2019b. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828.
  30. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018.
  31. Michael R. Cunningham. Weather, mood, and helping behavior: Quasi experiments with the sunshine samaritan. Journal of Personality and Social Psychology, 37:1947–1956, 1979.
  32. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
  33. Paul Ekman. Universals and cultural differences in facial expression of emotion. Nebraska Symposium on Motivation, 19:207–283, 1971.
  34. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
  35. Toy models of superposition. Transformer Circuits Thread, 2022.
  36. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  37. Truthful ai: Developing and governing ai that does not lie, 2021.
  38. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  8730–8738, 2018.
  39. The bases of social power. Classics of organization theory, 7(311-320):1, 1959.
  40. Erasing concepts from diffusion models, 2023.
  41. A framework for few-shot language model evaluation, September 2021.
  42. Murray Gell-Mann. The Quark and the Jaguar: Adventures in the Simple and the Complex. St. Martin’s Publishing Group, 1995. ISBN 9780805072532.
  43. Large language model ai chatbots require approval as medical devices. Nature Medicine, pp.  1–3, 2023.
  44. Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
  45. Yoshua Bengio Guillaume Alain. Understanding intermediate layers using linear classifier probes. ICLR 2017, 2017.
  46. Medalpaca – an open-source collection of medical conversational ai models and training data, 2023.
  47. Eric Hartford. ehartford’s hugging face repository. Hugging Face, 2023. URL https://huggingface.co/ehartford. Accessed: 2023-09-28.
  48. Deberta: Decoding-enhanced BERT with disentangled attention. CoRR, abs/2006.03654, 2020.
  49. Dan Hendrycks. Natural selection favors ais over humans. ArXiv, 2023.
  50. X-risk analysis for ai research. ArXiv, 2022.
  51. Aligning AI with shared human values. In International Conference on Learning Representations, 2021a.
  52. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021b.
  53. What would jiminy cricket do? towards agents that behave morally. NeurIPS, 2021c.
  54. An overview of catastrophic ai risks. ArXiv, 2023.
  55. Inspecting and editing knowledge representations in language models, 2023.
  56. Geoffrey E Hinton. Distributed representations. 1984.
  57. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37, 2022.
  58. A review of opportunities and challenges of chatbots in education. Interactive Learning Environments, 31(7):4099–4112, 2023.
  59. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357.
  60. Residual connections encourage iterative inference. In International Conference on Learning Representations, 2018.
  61. Quantifying chatgpt’s gender bias, Apr 2023. URL https://www.aisnakeoil.com/p/quantifying-chatgpts-gender-bias?utm_campaign=post&utm_medium=web.
  62. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  63. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp. 2668–2677. PMLR, 2018.
  64. The (un) reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning, pp.  267–280, 2019.
  65. Efficient fair pca for fair representation learning, 2023.
  66. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  67. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
  68. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
  69. Nancy G Leveson. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016.
  70. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023a.
  71. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023b.
  72. Inference-time intervention: Eliciting truthful answers from a language model, 2023c.
  73. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
  74. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  75. Truthfulqa: Measuring how models mimic human falsehoods. CoRR, abs/2109.07958, 2021.
  76. Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331–16345, 2021.
  77. James Edwin Mahon. The Definition of Lying and Deception. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2016 edition, 2016.
  78. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173.
  79. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022.
  80. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
  81. Locating and editing factual associations in gpt, 2023a.
  82. Mass-editing memory in a transformer, 2023b.
  83. Can a suit of armor conduct electricity? A new dataset for open book question answering. CoRR, abs/1809.02789, 2018.
  84. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics.
  85. Moods and compliance. British Journal of Social Psychology, 27(Pt 1):79–90, Mar 1988. doi: 10.1111/j.2044-8309.1988.tb00806.x.
  86. Inceptionism: Going deeper into neural networks. "", 2015.
  87. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp.  46–51, 2017.
  88. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5356–5371, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416.
  89. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
  90. Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pp.  55–76, 2019.
  91. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  92. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  93. Dinov2: Learning robust visual features without supervision, 2023.
  94. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. ICML, 2023.
  95. Discovering language model behaviors with model-written evaluations, 2022.
  96. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  97. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
  98. Improving language understanding by generative pre-training. "", 2018.
  99. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  100. Kernelized concept erasure, 2023.
  101. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  102. Kevin Roose. Bing’s ai chat:“i want to be alive”. The New York Times, 16, 2023.
  103. Bert has a moral compass: Improvements of ethical and moral values of machines. arXiv preprint arXiv:1912.05238, 2019.
  104. Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45:513–523, 09 1983. doi: 10.1037/0022-3514.45.3.513.
  105. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.  618–626, 2017.
  106. Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy, 2023.
  107. Gold doesn’t always glitter: Spectral removal of linear and nonlinear guarded attribute information, 2023.
  108. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9243–9252, 2020.
  109. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  110. My chatbot companion-a study of human-chatbot relationships. International Journal of Human-Computer Studies, 149:102601, 2021.
  111. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  112. Striving for simplicity: The all convolutional net. arxiv 2014. arXiv preprint arXiv:1412.6806, 2014.
  113. Evaluating gender bias in machine translation. CoRR, abs/1906.00591, 2019.
  114. Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319–3328. PMLR, 2017.
  115. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  116. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421.
  117. Stanford alpaca: An instruction-following llama model, 2023.
  118. Elliott Thornley. There are no coherence theorems. AI Alignment Forum, 2023.
  119. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback, 2023.
  120. Llama 2: Open foundation and fine-tuned chat models, 2023.
  121. Behavioural statistics for a maze-solving agent. AI Alignment Forum, 2023a.
  122. Maze-solving agents: Add a top-right vector, make the agent go to the top-right. AI Alignment Forum, 2023b.
  123. Steering gpt-2-xl by adding an activation vector. AI Alignment Forum, 2023c.
  124. Understanding and controlling a maze-solving policy network. AI Alignment Forum, 2023d.
  125. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023e.
  126. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
  127. Deep feature interpolation for image content changes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7064–7073, 2017.
  128. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  129. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018.
  130. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023.
  131. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  132. Coding inequity: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv, 2023. doi: 10.1101/2023.07.13.23292577.
  133. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.
  134. Gender bias in coreference resolution: Evaluation and debiasing methods. CoRR, abs/1804.06876, 2018.
  135. Mquake: Assessing knowledge editing in language models via multi-hop questions, 2023.
  136. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2921–2929, 2016.
  137. Interpretable basis decomposition for visual explanation. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  119–134, 2018.
  138. Universal and transferable adversarial attacks on aligned language models, 2023.
Citations (261)

Summary

  • The paper introduces Representation Engineering as a top-down method that abstracts high-level neural representations to improve AI transparency.
  • It demonstrates that manipulating key representation vectors can control LLM behaviors, enhancing honesty and ethical responses.
  • The approach offers actionable insights for AI safety by enabling real-time interventions and dynamic updates to factual knowledge.

Exploring the Potential of Representation Engineering in Enhancing AI Transparency and Safety

Understanding Representation Engineering

Representation Engineering (RepE) emerges as a pivotal approach in the evolving landscape of AI transparency and control. Traditionally, AI transparency research has revolved around dissecting neural networks at a granular level—examining neurons and circuits to uncover the underlying mechanisms of complex cognitive phenomena. However, this bottom-up analysis, focusing on the minutiae of neural connections, often falls short in explaining the higher-order cognitive functionalities that LLMs exhibit.

RepE presents itself as a top-down methodology for examining the internal workings of AI systems. Rooted in insights from cognitive neuroscience, specifically the Hopfieldian view, RepE prioritizes the paper of representations within neural networks. This approach seeks to abstract away the complexities of individual neurons to focus on the patterns of neural activity that encode high-level cognitive phenomena. By centering representations as the unit of analysis, RepE aims to provide a more intuitive and effective framework for interpreting the behaviors of sophisticated models.

Initial Findings and Advances in Transparency Research

Empirical evidence suggests that AI systems, especially LLMs, develop emergent structure within their representations that encapsulate various concepts and functions, including morality, utility, emotion, and even abstract notions like honesty. Through systematic analysis, researchers have demonstrated the feasibility of extracting and manipulating these representations to influence model behavior in meaningful ways.

For instance, by identifying representation vectors associated with specific concepts such as honesty, researchers have successfully guided LLMs to produce truth-oriented responses. This methodology has not only shown promise in enhancing model honesty but also extends to controlling a model's expression of emotions, adherence to ethical guidelines, and even its propensity to regurgitate memorized data.

Implications for AI Safety and Accountability

The insights derived from RepE have profound implications for AI safety and accountability. By enabling control over model representations, RepE offers a mechanism to steer LLMs away from undesired behaviors, such as generating biased or harmful content. Furthermore, this approach permits finer-grained monitoring of model states, thereby facilitating real-time interventions to ensure alignment with ethical standards and societal values.

Moreover, the ability to edit factual knowledge and conceptual understandings within a model paves the way for dynamic updates to AI systems—ensuring that they remain accurate, relevant, and devoid of outdated or incorrect information.

Prospects for Future Research and Development

While the initial exploration of RepE has yielded encouraging results, significant prospects for future research remain. One intriguing direction involves delving deeper into the nature of representations themselves—examining how different forms of information are encoded and transformed across network layers. Additionally, extending RepE methods to encompass not just static representations but also the trajectories and manifolds within representation spaces could unlock new dimensions of AI interpretability and control.

Another focal area for future work is the scalability and generalizability of RepE techniques across diverse AI architectures and applications. As AI systems continue their integration into various domains, the versatility of RepE in accommodating different model structures and functionalities will be crucial for broad adoption.

Conclusion

Representation Engineering marks a significant step forward in our quest for transparent, interpretable, and controllable AI systems. By shifting the lens from neurons and circuits to representations, RepE innovates a promising avenue for understanding and shaping the cognitive processes of AI. As we venture further into this domain, the collaborative efforts of researchers across disciplines will be instrumental in realizing the full potential of RepE, ensuring that AI advancements proceed in tandem with ethical frameworks and societal well-being.

Youtube Logo Streamline Icon: https://streamlinehq.com