Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Representation Surgery: Theory and Practice of Affine Steering (2402.09631v6)

Published 15 Feb 2024 in cs.LG, cs.CL, and cs.CY

Abstract: LLMs often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural LLMs, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural LLM's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Earth mover distance over high-dimensional spaces. In SODA, volume 8, pp.  343–352, 2008.
  2. Unsupervised clustering of multidimensional distributions using earth mover distance. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  636–644, 2011.
  3. Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
  4. Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1119–1130, 2016.
  5. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  4349–4357, 2016a. URL https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html.
  6. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016b.
  7. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  8. Fair regression with wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321–7331, 2020.
  9. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017.
  10. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV), pp.  447–463, 2018.
  11. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  12. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pp.  120–128, 2019.
  13. Bias in bios: A case study of semantic representation bias in a high-stakes setting. CoRR, abs/1901.09451, 2019. URL http://arxiv.org/abs/1901.09451.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  15. Adversarial removal of demographic attributes from text data. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  11–21. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1002. URL https://doi.org/10.18653/v1/d18-1002.
  16. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
  17. Causalm: Causal model explanation through counterfactual language models. Computational Linguistics, 47(2):333–386, 2021.
  18. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell, 1(1-40):2, 2016.
  19. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  20. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  21. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, 2021.
  22. Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102, 2024.
  23. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  609–614, 2019.
  24. Obtaining fairness using optimal transport theory. In International conference on machine learning, pp.  2357–2365. PMLR, 2019.
  25. Gordaliza Pastor, P. et al. Fair learning: an optimal transport based approach. 2020.
  26. Backpack language models. arXiv preprint arXiv:2305.16765, 2023.
  27. Kantorovich, L. V. Mathematical methods of organizing and planning production. Management science, 6(4):366–422, 1960.
  28. On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43:39–49, 1984.
  29. Style transfer by relaxed optimal transport and self-similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10051–10060, 2019.
  30. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
  31. Everything is relative: Understanding fairness with optimal transport. arXiv preprint arXiv:2102.10349, 2021.
  32. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  33. Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
  34. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  35. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  36. Pointer sentinel mixture models, 2016.
  37. Monge, G. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pp.  666–704, 1781.
  38. Mroueh, Y. Wasserstein style transfer. arXiv preprint arXiv:1905.12828, 2019.
  39. Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
  40. Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models. arXiv preprint arXiv:2310.07589, 2023.
  41. Language models are unsupervised multitask learners. 2019.
  42. Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, 2020.
  43. Adversarial concept erasure in kernel space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6034–6055, 2022.
  44. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3407–3412, 2019.
  45. Approximate earth mover’s distance in linear time. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.  1–8. IEEE, 2008.
  46. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124, 2022.
  47. Llama: Open and efficient foundation language models, 2023.
  48. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  49. Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
  50. Domain-adaptive pretraining methods for dialogue understanding. arXiv preprint arXiv:2105.13665, 2021.
  51. Fair and optimal classification via post-processing. In International Conference on Machine Learning, pp.  37977–38012. PMLR, 2023.
  52. Unified detoxifying and debiasing in language generation via inference-time adaptive optimization. arXiv preprint arXiv:2210.04492, 2022.
  53. Matching code and law: achieving algorithmic fairness with optimal transport. Data Mining and Knowledge Discovery, 34(1):163–200, 2020.
Citations (4)

Summary

  • The paper proposes a novel affine steering method that uses translation vectors and minimal representation changes to effectively mitigate bias and toxicity.
  • It offers a robust theoretical framework aligning representation statistics via least-squares and optimal transport principles to guide modifications.
  • Experimental results on gender classification and text toxicity tasks show reduced sensitive attribute influence while maintaining core performance.

Representation Surgery: Theory and Practice of Affine Steering

The paper "Representation Surgery: Theory and Practice of Affine Steering" presents a comprehensive theoretical and empirical paper on the manipulation of neural representations in LLMs to mitigate undesirable outputs, such as biased or toxic text generation. This research focuses on affine steering functions as a tool to alter the behavior of models by transforming internal representations toward desired conceptual outcomes, thereby reducing their propensity to generate unwanted outputs.

Theoretical Insights

The authors propose a mathematical framework for affine steering functions applied to neural LLMs. They derive two main forms of affine transformations that provide an optimal steering mechanism, one maintaining least-squares minimal change in representations and another that aligns with both mean and covariance statistics of the target concept. This formulation incorporates guardedness constraints, ensuring that concepts are encoded in a manner that precludes their linear separability, drawing on existing ideas from concept erasure.

A primary contribution of this work is the theoretical justification of steering with translation vectors, as opposed to more complex transformations. The authors further bridge the gap with the optimal transport theory, highlighting connections with minimum Earth Mover's distance between Gaussians, providing both formal and practical relevance in terms of noise reduction and bias mitigation.

Empirical Evaluation

Empirically, the paper validates the efficacy of the derived affine steering functions across two main contexts: fairness in multiclass classification and the mitigation of textual toxicity. Through experiments on datasets for gender-biased profession classification (Bios dataset) and controlled dialectical biases in sentiment analysis, the authors demonstrate reduced TPR (True Positive Rate) gaps while retaining model performance on primary tasks. The interventions yield representations that are, essentially, less clustered by sensitive attributes—an approach tangential to bias by neighbors analysis—offering tangible advantages over existing methods like LEACE and adversarial concept erasure.

In text generation, particularly with LLMs producing potentially toxic outputs, the affine steering methods curtail the maximum expected toxicity without significant degradation of semantic quality. Despite not surpassing all state-of-the-art models, these methods, notably, do not require fine-tuning or gradient computation at inference, maintaining computational efficiency and practicality.

Implications and Future Directions

The results have significant implications for improving fairness and safety in AI systems by allowing precise control over model behavior with theoretically grounded techniques. The introduction of affine steering functions provides a practical, interpretable, and mathematically robust strategy for managing neural representation biases, potentially setting a standard for ethical AI development.

As these steering methods rely on differentiable and algebraically simple transformations, they also invite extensions into the nonlinear domain, potentially leveraging kernel methods or neural-inspired architectures for wider applicability and enhanced control over higher dimensional biases. Future exploration could involve investigating how such interventions generalize across diversified model architectures and applications beyond language tasks.

Overall, this paper contributes a vital component to the toolkit of techniques aimed at mitigating AI bias and toxicity, supplementing ongoing efforts to align AI outputs with ethical and socially acceptable standards.

Youtube Logo Streamline Icon: https://streamlinehq.com