Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment (2311.09433v3)
Abstract: To ensure AI safety, instruction-tuned LLMs are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
- Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207 (2016).
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Yonatan Belinkov. 2022. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics 48, 1 (March 2022), 207–219. https://doi.org/10.1162/coli_a_00422
- What do Neural Machine Translation Models Learn about Morphology?. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 861–872. https://doi.org/10.18653/v1/P17-1080
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. arXiv preprint arXiv:2309.14348 (2023).
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
- Canyu Chen and Kai Shu. 2023a. Can LLM-Generated Misinformation Be Detected? arXiv preprint arXiv:2309.13788 (2023).
- Canyu Chen and Kai Shu. 2023b. Combating Misinformation in the Age of LLMs: Opportunities and Challenges. arXiv preprint arXiv:2311.05656 (2023).
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 (2023).
- Attack Prompt Generation for Red Teaming and Defending Large Language Models. arXiv preprint arXiv:2310.12505 (2023).
- Tag: Gradient attack on transformer-based language models. arXiv preprint arXiv:2103.06819 (2021).
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335 (2023).
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 862–872.
- Investigating Trojan Attacks on Pre-trained Language Model-powered Database Middleware. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 437–447.
- Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 680–686.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
- Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1828–1843. https://doi.org/10.18653/v1/2021.acl-long.144
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375 (2022).
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv preprint arXiv:2302.12173 (2023).
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509 (2022).
- Measuring and manipulating knowledge representations in language models. arXiv preprint arXiv:2304.00740 (2023).
- An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). Association for Computational Linguistics, Toronto, Canada, 121–134. https://doi.org/10.18653/v1/2023.trustnlp-1.11
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Yue Huang and Lichao Sun. 2023. Harnessing the Power of ChatGPT in Fake News: An In-Depth Exploration in Generation, Detection and Explanation. arXiv preprint arXiv:2310.05046 (2023).
- Training-free Lexical Backdoor Attacks on Language Models. In Proceedings of the ACM Web Conference 2023. 2198–2208.
- C.J. Hutto. 2022. VADER-Sentiment-Analysis. https://github.com/cjhutto/vaderSentiment
- Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
- Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries. (2023). arXiv:2310.13132 [cs.CL]
- Pretraining language models with human preferences. In International Conference on Machine Learning. PMLR, 17506–17533.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197 (2023).
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382 (2022).
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv preprint arXiv:2306.03341 (2023).
- Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984 (2020).
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
- Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv preprint arXiv:2308.05374 (2023).
- Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35 (2022), 17359–17372.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229 (2022).
- Recent advances in natural language processing via large pre-trained language models: A survey. Comput. Surveys (2021).
- StereoSet: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456 (2020).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
- On the Risk of Misinformation Pollution with Large Language Models. arXiv preprint arXiv:2305.13661 (2023).
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527 (2022).
- Nina Rimsky. 2023a. Red-teaming language models via activation engineering. https://www.lesswrong.com/posts/iHmsJdxgMEWmAfNne/red-teaming-language-models-via-activation-engineering
- Nina Rimsky. 2023b. Understanding and visualizing sycophancy datasets. https://www.lesswrong.com/posts/ZX9rgMfvZaxBseoYi/understanding-and-visualizing-sycophancy-datasets
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv preprint arXiv:2310.03684 (2023).
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- ” Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv preprint arXiv:2308.03825 (2023).
- Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124 (2022).
- BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950 (2019).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248 (2023).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems 33 (2020), 12388–12401.
- Poisoning Language Models During Instruction Tuning. arXiv preprint arXiv:2305.00944 (2023).
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022).
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
- Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958 (2023).
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
- Large language models can be good privacy protection learners. arXiv preprint arXiv:2310.02469 (2023).
- Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3 (2020), 1–41.
- Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 436, 20 pages. https://doi.org/10.1145/3544548.3581318
- Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
- Haoran Wang (141 papers)
- Kai Shu (88 papers)