Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Fluent dreaming for language models (2402.01702v1)

Published 24 Jan 2024 in cs.CL and cs.AI

Abstract: Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to LLMs because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the LLM adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for LLMs. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare LLM dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “What does BERT dream of?”, 2018 URL: https://pair-code.github.io/interpretability/text-dream/blogpost/
  2. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling”, 2023 arXiv:2304.01373 [cs.CL]
  3. “Language models can explain neurons in language models”, https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023
  4. “An Interpretability Illusion for BERT”, 2021 arXiv:2104.07143 [cs.CL]
  5. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” https://transformer-circuits.pub/2023/monosemantic-features/index.html In Transformer Circuits Thread, 2023
  6. “Thread: Circuits” https://distill.pub/2020/circuits In Distill, 2020 DOI: 10.23915/distill.00024
  7. “Sparse Autoencoders Find Highly Interpretable Features in Language Models”, 2023 arXiv:2309.08600 [cs.LG]
  8. “E.O. 14110 of Oct 30, 2023: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” In 88 Federal Register 75191, 2023
  9. “HotFlip: White-Box Adversarial Examples for Text Classification”, 2018 arXiv:1712.06751 [cs.CL]
  10. “A Mathematical Framework for Transformer Circuits” https://transformer-circuits.pub/2021/framework/index.html In Transformer Circuits Thread, 2021
  11. “Toy Models of Superposition” https://transformer-circuits.pub/2022/toy_model/index.html In Transformer Circuits Thread, 2022
  12. “Finding Neurons in a Haystack: Case Studies with Sparse Probing”, 2023 arXiv:2305.01610 [cs.LG]
  13. Eric Jang, Shixiang Gu and Ben Poole “Categorical Reparameterization with Gumbel-Softmax”, 2017 arXiv:1611.01144 [stat.ML]
  14. “Automatically Auditing Large Language Models via Discrete Optimization”, 2023 arXiv:2303.04381 [cs.LG]
  15. Sachin Kumar, Biswajit Paria and Yulia Tsvetkov “Gradient-Based Constrained Sampling from Language Models”, 2022 arXiv:2205.12558 [cs.CL]
  16. Alexander Mordvintsev, Christopher Olah and Mike Tyka “Inceptionism: Going Deeper into Neural Networks”, 2015 URL: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
  17. Chris Olah, Alexander Mordvintsev and Ludwig Schubert “Feature Visualization” https://distill.pub/2017/feature-visualization In Distill, 2017 DOI: 10.23915/distill.00007
  18. “The Building Blocks of Interpretability” https://distill.pub/2018/building-blocks In Distill, 2018 DOI: 10.23915/distill.00010
  19. Nina Poerner, Benjamin Roth and Hinrich Schütze “Interpretable Textual Neuron Representations for NLP” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 325–327 DOI: 10.18653/v1/W18-5437
  20. “Toward Human Readable Prompt Tuning: Kubrick’s The Shining is a good movie, and a good prompt too?”, 2022 arXiv:2212.10539 [cs.CL]
  21. “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts”, 2020 arXiv:2010.15980 [cs.CL]
  22. “Intriguing properties of neural networks”, 2014 arXiv:1312.6199 [cs.CV]
  23. “Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery”, 2023 arXiv:2302.03668 [cs.LG]
  24. “Understanding Neural Networks Through Deep Visualization”, 2015 arXiv:1506.06579 [cs.CV]
  25. “Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework”, 2023 arXiv:2110.15317 [cs.CL]
  26. “AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models” In arXiv preprint arXiv:2310.15140, 2023
  27. “Universal and Transferable Adversarial Attacks on Aligned Language Models”, 2023 arXiv:2307.15043 [cs.CL]
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 8 likes.

Upgrade to Pro to view all of the tweets about this paper: