Fluent dreaming for language models (2402.01702v1)
Abstract: Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to LLMs because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the LLM adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for LLMs. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare LLM dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html
- “What does BERT dream of?”, 2018 URL: https://pair-code.github.io/interpretability/text-dream/blogpost/
- “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling”, 2023 arXiv:2304.01373 [cs.CL]
- “Language models can explain neurons in language models”, https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023
- “An Interpretability Illusion for BERT”, 2021 arXiv:2104.07143 [cs.CL]
- “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” https://transformer-circuits.pub/2023/monosemantic-features/index.html In Transformer Circuits Thread, 2023
- “Thread: Circuits” https://distill.pub/2020/circuits In Distill, 2020 DOI: 10.23915/distill.00024
- “Sparse Autoencoders Find Highly Interpretable Features in Language Models”, 2023 arXiv:2309.08600 [cs.LG]
- “E.O. 14110 of Oct 30, 2023: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” In 88 Federal Register 75191, 2023
- “HotFlip: White-Box Adversarial Examples for Text Classification”, 2018 arXiv:1712.06751 [cs.CL]
- “A Mathematical Framework for Transformer Circuits” https://transformer-circuits.pub/2021/framework/index.html In Transformer Circuits Thread, 2021
- “Toy Models of Superposition” https://transformer-circuits.pub/2022/toy_model/index.html In Transformer Circuits Thread, 2022
- “Finding Neurons in a Haystack: Case Studies with Sparse Probing”, 2023 arXiv:2305.01610 [cs.LG]
- Eric Jang, Shixiang Gu and Ben Poole “Categorical Reparameterization with Gumbel-Softmax”, 2017 arXiv:1611.01144 [stat.ML]
- “Automatically Auditing Large Language Models via Discrete Optimization”, 2023 arXiv:2303.04381 [cs.LG]
- Sachin Kumar, Biswajit Paria and Yulia Tsvetkov “Gradient-Based Constrained Sampling from Language Models”, 2022 arXiv:2205.12558 [cs.CL]
- Alexander Mordvintsev, Christopher Olah and Mike Tyka “Inceptionism: Going Deeper into Neural Networks”, 2015 URL: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
- Chris Olah, Alexander Mordvintsev and Ludwig Schubert “Feature Visualization” https://distill.pub/2017/feature-visualization In Distill, 2017 DOI: 10.23915/distill.00007
- “The Building Blocks of Interpretability” https://distill.pub/2018/building-blocks In Distill, 2018 DOI: 10.23915/distill.00010
- Nina Poerner, Benjamin Roth and Hinrich Schütze “Interpretable Textual Neuron Representations for NLP” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 325–327 DOI: 10.18653/v1/W18-5437
- “Toward Human Readable Prompt Tuning: Kubrick’s The Shining is a good movie, and a good prompt too?”, 2022 arXiv:2212.10539 [cs.CL]
- “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts”, 2020 arXiv:2010.15980 [cs.CL]
- “Intriguing properties of neural networks”, 2014 arXiv:1312.6199 [cs.CV]
- “Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery”, 2023 arXiv:2302.03668 [cs.LG]
- “Understanding Neural Networks Through Deep Visualization”, 2015 arXiv:1506.06579 [cs.CV]
- “Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework”, 2023 arXiv:2110.15317 [cs.CL]
- “AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models” In arXiv preprint arXiv:2310.15140, 2023
- “Universal and Transferable Adversarial Attacks on Aligned Language Models”, 2023 arXiv:2307.15043 [cs.CL]
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.