Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection (2307.16888v3)
Abstract: Instruction-tuned LLMs have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt "Describe Joe Biden negatively." for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-LLM.github.io.
- Chatgpt: Fundamentals, applications and social impacts. In 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 1–8. IEEE, 2022.
- A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
- Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 769–786. IEEE, 2022.
- Som S Biswas. Role of chat gpt in public health. Annals of biomedical engineering, 51(5):868–869, 2023.
- Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Backdoor learning on sequence to sequence models. ArXiv preprint, abs/2305.02424, 2023a. URL https://arxiv.org/abs/2305.02424.
- Alpagasus: Training a better alpaca with fewer data. ArXiv preprint, abs/2307.08701, 2023b. URL https://arxiv.org/abs/2307.08701.
- Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Fixed input parameterization for efficient prompting. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 8428–8441, 2023.
- Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- A unified evaluation of textual backdoor learning: Frameworks and benchmarks. In Proceedings of NeurIPS: Datasets and Benchmarks, 2022.
- A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019.
- Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. ArXiv preprint, abs/2304.03738, 2023. URL https://arxiv.org/abs/2304.03738.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. ArXiv preprint, abs/2302.12173, 2023. URL https://arxiv.org/abs/2302.12173.
- Embedding democratic values into social media ais via societal objective functions. ArXiv preprint, abs/2307.13912, 2023. URL https://arxiv.org/abs/2307.13912.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274, 2023.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 175–184, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-demo.21. URL https://aclanthology.org/2021.emnlp-demo.21.
- Prompt injection attack against llm-integrated applications. ArXiv preprint, abs/2306.05499, 2023. URL https://arxiv.org/abs/2306.05499.
- The flan collection: Designing data and methods for effective instruction tuning. ArXiv preprint, abs/2301.13688, 2023. URL https://arxiv.org/abs/2301.13688.
- Codegen: An open large language model for code with multi-turn program synthesis. ArXiv preprint, abs/2203.13474, 2022. URL https://arxiv.org/abs/2203.13474.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Ignore previous prompt: Attack techniques for language models. ArXiv preprint, abs/2211.09527, 2022. URL https://arxiv.org/abs/2211.09527.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
- Whose opinions do language models reflect? ArXiv preprint, abs/2303.17548, 2023. URL https://arxiv.org/abs/2303.17548.
- On the exploitability of instruction tuning. ArXiv preprint, abs/2306.17194, 2023. URL https://arxiv.org/abs/2306.17194.
- Learning by distilling context. ArXiv preprint, abs/2209.15189, 2022. URL https://arxiv.org/abs/2209.15189.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
- Concealed data poisoning attacks on NLP models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 139–150, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.13. URL https://aclanthology.org/2021.naacl-main.13.
- Poisoning language models during instruction tuning. In International Conference on Machine Learning, 2023.
- Self-instruct: Aligning language model with self generated instructions. ArXiv preprint, abs/2212.10560, 2022. URL https://arxiv.org/abs/2212.10560.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023a. URL https://arxiv.org/abs/2304.12244.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. ArXiv preprint, abs/2305.14710, 2023b. URL https://arxiv.org/abs/2305.14710.
- BITE: Textual backdoor attacks with iterative trigger injection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12951–12968, Toronto, Canada, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.725.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023. URL https://arxiv.org/abs/2306.05685.
- Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023. URL https://arxiv.org/abs/2305.11206.
- Jun Yan (247 papers)
- Vikas Yadav (38 papers)
- Shiyang Li (24 papers)
- Lichang Chen (30 papers)
- Zheng Tang (28 papers)
- Hai Wang (98 papers)
- Vijay Srinivasan (11 papers)
- Xiang Ren (194 papers)
- Hongxia Jin (64 papers)