Representation Surgery: Theory and Practice of Affine Steering (2402.09631v6)
Abstract: LLMs often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural LLMs, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural LLM's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
- Earth mover distance over high-dimensional spaces. In SODA, volume 8, pp. 343–352, 2008.
- Unsupervised clustering of multidimensional distributions using earth mover distance. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 636–644, 2011.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
- Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1119–1130, 2016.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4349–4357, 2016a. URL https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016b.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Fair regression with wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321–7331, 2020.
- Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017.
- Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV), pp. 447–463, 2018.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 120–128, 2019.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. CoRR, abs/1901.09451, 2019. URL http://arxiv.org/abs/1901.09451.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Adversarial removal of demographic attributes from text data. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 11–21. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1002. URL https://doi.org/10.18653/v1/d18-1002.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
- Causalm: Causal model explanation through counterfactual language models. Computational Linguistics, 47(2):333–386, 2021.
- Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell, 1(1-40):2, 2016.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021.
- Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102, 2024.
- Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 609–614, 2019.
- Obtaining fairness using optimal transport theory. In International conference on machine learning, pp. 2357–2365. PMLR, 2019.
- Gordaliza Pastor, P. et al. Fair learning: an optimal transport based approach. 2020.
- Backpack language models. arXiv preprint arXiv:2305.16765, 2023.
- Kantorovich, L. V. Mathematical methods of organizing and planning production. Management science, 6(4):366–422, 1960.
- On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43:39–49, 1984.
- Style transfer by relaxed optimal transport and self-similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10051–10060, 2019.
- Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
- Everything is relative: Understanding fairness with optimal transport. arXiv preprint arXiv:2102.10349, 2021.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
- Pointer sentinel mixture models, 2016.
- Monge, G. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pp. 666–704, 1781.
- Mroueh, Y. Wasserstein style transfer. arXiv preprint arXiv:1905.12828, 2019.
- Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
- Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models. arXiv preprint arXiv:2310.07589, 2023.
- Language models are unsupervised multitask learners. 2019.
- Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7237–7256, 2020.
- Adversarial concept erasure in kernel space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6034–6055, 2022.
- The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–3412, 2019.
- Approximate earth mover’s distance in linear time. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, 2008.
- Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124, 2022.
- Llama: Open and efficient foundation language models, 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
- Domain-adaptive pretraining methods for dialogue understanding. arXiv preprint arXiv:2105.13665, 2021.
- Fair and optimal classification via post-processing. In International Conference on Machine Learning, pp. 37977–38012. PMLR, 2023.
- Unified detoxifying and debiasing in language generation via inference-time adaptive optimization. arXiv preprint arXiv:2210.04492, 2022.
- Matching code and law: achieving algorithmic fairness with optimal transport. Data Mining and Knowledge Discovery, 34(1):163–200, 2020.