Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Spectral Editing of Activations for Large Language Model Alignment (2405.09719v3)

Published 15 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11833–11856, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.660. URL https://aclanthology.org/2023.acl-long.660.
  4. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2022.
  5. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  20967–20974, 2024.
  6. DoLa: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2023.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, 2022.
  9. Queer people are people first: Deconstructing sexual identity stereotypes in large language models. arXiv preprint arXiv:2307.00101, 2023.
  10. The secret is in the spectra: Predicting cross-lingual task performance with spectral similarity measures. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2377–2390, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.186. URL https://aclanthology.org/2020.emnlp-main.186.
  11. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  12. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3309–3326, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.234. URL https://aclanthology.org/2022.acl-long.234.
  13. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  15. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  5961–5977, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.369. URL https://aclanthology.org/2023.findings-acl.369.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  18. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics, 7:453–466, 08 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
  19. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.397. URL https://aclanthology.org/2023.emnlp-main.397.
  20. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
  21. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12286–12312, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
  22. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, 2022.
  23. In-context vectors: Making in-context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668, 2023.
  24. Sources of hallucination by large language models on inference tasks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  2758–2774, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.182. URL https://aclanthology.org/2023.findings-emnlp.182.
  25. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  2086–2105, 2022.
  26. Detecting and mitigating hallucinations in multilingual summarisation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8914–8932, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.551. URL https://aclanthology.org/2023.emnlp-main.551.
  27. A trip towards fairness: Bias and de-biasing in large language models. arXiv preprint arXiv:2305.13862, 2023.
  28. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  29. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  30. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  31. Erasure of unaligned attributes from neural representations. Transactions of the Association for Computational Linguistics, 11:488–510, 2023a. doi: 10.1162/tacl_a_00558. URL https://aclanthology.org/2023.tacl-1.29.
  32. Gold doesn’t always glitter: Spectral removal of linear and nonlinear guarded attribute information. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  1611–1622, Dubrovnik, Croatia, May 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.118. URL https://aclanthology.org/2023.eacl-main.118.
  33. Mimic: Minimally modified counterfactuals in the representation space. arXiv preprint arXiv:2402.09631, 2024.
  34. Extracting latent steering vectors from pretrained language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  566–581, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.48. URL https://aclanthology.org/2022.findings-acl.48.
  35. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  36. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Activation Addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  39. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  40. Alleviating hallucinations of large language models through induced hallucinations. arXiv preprint arXiv:2312.15710, 2023.
  41. Understanding domain learning in language models through subpopulation analysis. In Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, and Sarah Wiegreffe (eds.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  192–209, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.blackboxnlp-1.16. URL https://aclanthology.org/2022.blackboxnlp-1.16.
  42. A joint matrix factorization analysis of multilingual representations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  12764–12783, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.851. URL https://aclanthology.org/2023.findings-emnlp.851.
Citations (8)

Summary

  • The paper introduces Spectral Editing of Activations (SEA) that improves LLM truthfulness by steering activations toward positive demonstration directions.
  • It employs Singular Value Decomposition on covariance matrices of activations to develop editing matrices that mitigate hallucinated and biased outputs.
  • Experimental results on TruthfulQA and BBQ benchmarks demonstrate SEA’s potential for effective, real-time behavior correction in LLMs.

Inference-Time Activation Editing in LLMs: The Spectral Approach

The paper presents a novel method for inference-time editing of LLMs by proposing Spectral Editing of Activations (SEA). This method aims to mitigate undesirable behaviors such as generating untruthful or biased content. The central strategy is to project the model’s internal representations in directions highly correlated with positive demonstrations (i.e., truthful responses) while minimizing correlation with negative demonstrations (i.e., hallucinated responses).

Methodology

Spectral Editing of Activations (SEA)

The SEA framework operates by first tracking the LLM activations during inference for several demonstrations. These demonstrations consist of positive and negative examples to delineate the desired and undesired behaviors, respectively. The core idea is to perform Singular Value Decomposition (SVD) on the covariance matrices derived from these activations:

  • Positive and neutral activations are used to compute the covariance matrix Ω+\Omega^+.
  • Negative and neutral activations are used to compute the covariance matrix Ω−\Omega^-.

SVD is then applied to these covariance matrices to extract the editing projections. The authors develop editing matrices based on the singular vectors that either maximize or minimize the covariance, depending on the desired outcome. For non-linear editing, the method employs an invertible non-linear feature function to transform the activations into a richer space before applying the edits and then transforming the edited activations back into the original space.

Implementation and Experimentation

Two primary attributes, truthfulness and fairness, are the focal points for demonstrating SEA’s efficacy. Extensive experiments were conducted using datasets like TruthfulQA and BBQ to evaluate the improvements in model outputs post-editing. A key observation is that SEA enhances performance on these benchmarks with both linear and non-linear editing approaches, demonstrating significant gains in reducing inaccuracies and biases.

Results

The performance of the SEA method was scrutinized against several baselines, including In-Context Learning (ICL), LoRA Fine-tuning (LoRA-FT), ITI, DoLA, CD, and ICD. The results reveal the superiority of SEA:

  • Truthfulness: SEA applied to the 7B LLaMA-2-chat model improves the TruthfulQA MC1 score from 36.96 to 39.41 with minimal impact on inference time.
  • Bias: Non-linear SEA significantly enhances BBQ accuracy, reducing unknown-answer rates and stereotypical response rates.

Additionally, the ablation paper demonstrates that positive and negative editing projections contribute complementary information, and the feature normalization technique is crucial for maintaining coherence in the edited activations.

Analysis and Implications

The spectral analysis uncovers that the top layers of LLMs contain critical information related to truthfulness, making them the optimal targets for editing. The generalization of the proposed method to various LLMs, such as LLaMA-2, Gemma, and Mistral, substantiates its robustness and versatility. However, the decline in performance on control tasks like commonsense reasoning and mathematical tasks underlines the need for further refinement in non-linear editing to avoid fidelity loss.

Practical and Theoretical Insights

Practically, SEA provides a lightweight yet effective method to improve LLM behaviors in real-time without the necessity of full model retraining. This inference-time intervention can be integrated into existing AI systems to enhance output reliability dynamically. Theoretically, the approach sheds light on the internal structure of LLMs, emphasizing the roles of specific layers and activation patterns in generating biased or hallucinated content and proposing targeted interventions to correct such behaviors.

Future Directions

Future research could explore more sophisticated non-linear transformations and their invertibility to preserve more detailed characteristics of the edited activations. Investigating the application of SEA in other domains, such as sentiment analysis or conversation AI, could extend its efficacy and utility. Additionally, integrating SEA with reinforcement learning frameworks to dynamically adapt and optimize the editing process based on feedback could offer further enhancements.

In conclusion, the proposed spectral editing methodology provides a structured, efficient, and generalizable way to steer LLMs towards more accurate and unbiased content generation, marking a significant advancement in the domain of AI behavior correction at inference time.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 tweets and received 135 likes.

Upgrade to Pro to view all of the tweets about this paper: