Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks (2312.04748v1)
Abstract: Growing applications of LLMs trained by a third party raise serious concerns on the security vulnerability of LLMs.It has been demonstrated that malicious actors can covertly exploit these vulnerabilities in LLMs through poisoning attacks aimed at generating undesirable outputs. While poisoning attacks have received significant attention in the image domain (e.g., object detection), and classification tasks, their implications for generative models, particularly in the realm of natural language generation (NLG) tasks, remain poorly understood. To bridge this gap, we perform a comprehensive exploration of various poisoning techniques to assess their effectiveness across a range of generative tasks. Furthermore, we introduce a range of metrics designed to quantify the success and stealthiness of poisoning attacks specifically tailored to NLG tasks. Through extensive experiments on multiple NLG tasks, LLMs and datasets, we show that it is possible to successfully poison an LLM during the fine-tuning stage using as little as 1\% of the total tuning data samples. Our paper presents the first systematic approach to comprehend poisoning attacks targeting NLG tasks considering a wide range of triggers and attack settings. We hope our findings will assist the AI security community in devising appropriate defenses against such threats.
- Poisoning web-scale training datasets is practical, 2023.
- Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, pages 1–18, 2022.
- Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online, July 2020. Association for Computational Linguistics.
- Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4873–4883, Online, August 2021. Association for Computational Linguistics.
- Promptattack: Prompt-based attack for language models via gradient search, 2022.
- Prompt as triggers for backdoor attack: Examining the vulnerability in language models, 2023.
- Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt, 2023.
- Exploring the universal vulnerability of prompt-based learning paradigm. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1799–1810, Seattle, United States, July 2022. Association for Computational Linguistics.
- Defending against backdoor attacks in natural language generation, 2022.
- A survey of natural language generation. ACM Comput. Surv., 55(8), dec 2022.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics.
- The power of scale for parameter-efficient prompt tuning, 2021.
- Badprompt: Backdoor attacks on continuous prompts, 2022.
- Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 680–686. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
- Trojaning language models for fun and profit. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pages 179–197, 2021.
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.
- Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, page 2041–2055, New York, NY, USA, 2019. Association for Computing Machinery.
- Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models, 2021.
- Weight poisoning attacks on pre-trained models, 2020.
- Triggerless backdoor attack for nlp tasks with clean labels, 2022.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models, 2023.
- Two-in-one: A model hijacking attack against text generation models, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Shuli Jiang (8 papers)
- Swanand Ravindra Kadhe (9 papers)
- Yi Zhou (438 papers)
- Ling Cai (22 papers)
- Nathalie Baracaldo (34 papers)