A Language Model's Guide Through Latent Space (2402.14433v1)
Abstract: Concept guidance has emerged as a cheap and simple way to control the behavior of LLMs by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.
- Refusal mechanisms: initial experiments with Llama-2-7b-chat. 2023. URL https://www.lesswrong.com/posts/pYcEhoAoPfHhgJ8YC.
- A general language assistant as a laboratory for alignment, 2021.
- The internal state of an llm knows when it’s lying, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 05.01. 2024), 2023.
- Language models can explain neurons in language models. URL https://transformer-circuits.pub/2023/monosemantic-features. html.(Date accessed: 05.01. 2024), 2023.
- Discovering latent knowledge in language models without supervision, 2022.
- Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069, 2018.
- Diffusion models beat gans on image synthesis, 2021.
- Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM, June 2022. doi: 10.1145/3531146.3533229. URL http://dx.doi.org/10.1145/3531146.3533229.
- Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
- Language models represent space and time, 2023.
- The consensus game: Language model generation via equilibrium search. arXiv preprint arXiv:2310.09139, 2023.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
- Mistral 7b, 2023.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Do neural language models show preferences for syntactic formalisms? arXiv preprint arXiv:2004.14096, 2020.
- Specific versus general principles for constitutional ai. arXiv preprint arXiv:2310.13798, 2023.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, October 2023. URL http://arxiv.org/abs/2306.03341. arXiv:2306.03341 [cs].
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
- Eliciting Latent Knowledge from Quirky Language Models, December 2023. URL http://arxiv.org/abs/2312.01037. arXiv:2312.01037 [cs].
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, December 2023. URL http://arxiv.org/abs/2310.06824. arXiv:2310.06824 [cs].
- Linguistic regularities in continuous space word representations. In Vanderwende, L., Daumé III, H., and Kirchhoff, K. (eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090.
- Emergent linear representations in world models of self-supervised sequence models, 2023.
- nostalgebraist. interpreting gpt: the logit lens. 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru.
- In-context learning and induction heads, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Discovering Language Model Behaviors with Model-Written Evaluations, December 2022. URL http://arxiv.org/abs/2212.09251. arXiv:2212.09251 [cs].
- Rimsky, N. Reducing sycophancy and improving honesty via activation steering. 2023. URL https://www.lesswrong.com/posts/zt6hRsDE84HeBKh7E.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
- Practices for governing agentic ai systems.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
- Chess as a testbed for language model state tracking, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, November 2022. URL http://arxiv.org/abs/2211.00593. arXiv:2211.00593 [cs].
- Representation Engineering: A Top-Down Approach to AI Transparency, October 2023. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.