Learning to Poison Large Language Models During Instruction Tuning (2402.13459v2)
Abstract: The advent of LLMs has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during the instruction tuning of LLMs and emphasizes the necessity of safeguarding LLMs against data poisoning attacks.
- How to backdoor federated learning. In International conference on artificial intelligence and statistics, pages 2938–2948. PMLR.
- Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pages 634–643. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual computer security applications conference, pages 554–569.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
- Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages.
- Triggerless backdoor attack for nlp tasks with clean labels. arXiv preprint arXiv:2111.07970.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv preprint arXiv:2108.13888.
- Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Trojtext: Test-time invisible textual trojan insertion. arXiv preprint arXiv:2303.02242.
- Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Hijacking large language models via adversarial in-context learning. arXiv preprint arXiv:2311.09948.
- Javier Rando and Florian Tramèr. 2023. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455.
- Backdoor attacks on self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13337–13346.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Prompt-specific poisoning attacks on text-to-image generative models. arXiv preprint arXiv:2310.13828.
- Backdoor pre-trained models can transfer to all. arXiv preprint arXiv:2111.00197.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
- On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Stanford alpaca: An instruction-following llama model.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2779–2792.
- Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229.
- Dba: Distributed backdoor attacks against federated learning. In 8th International Conference on Learning Representations, ICLR 2020.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
- Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
- Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2560–2564. IEEE.
- Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14443–14452.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Yao Qiang (16 papers)
- Xiangyu Zhou (51 papers)
- Saleh Zare Zade (3 papers)
- Mohammad Amin Roshani (3 papers)
- Douglas Zytko (10 papers)
- Dongxiao Zhu (41 papers)
- Prashant Khanduri (29 papers)