BadEdit: Backdooring large language models by model editing (2403.13355v1)
Abstract: Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to LLMs. To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Badprompt: Backdoor attacks on continuous prompts. Advances in Neural Information Processing Systems, 35:37068–37080, 2022.
- Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models. In International Conference on Learning Representations, 2021.
- Clean-image backdoor: Attacking multi-label models with poisoned labels only. In The Eleventh International Conference on Learning Representations, 2022.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
- Can adversarial weight perturbations inject neural backdoors. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2029–2032, 2020.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
- Aging with grace: lifelong model editing with discrete key-value adaptors. corr, abs/2211.11031, 2022. doi: 10.48550. arXiv preprint arXiv.2211.11031.
- Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models, 2023a.
- Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785, 2023b.
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2793–2806, 2020.
- Large language models with controllable working memory. arXiv preprint arXiv:2211.05110, 2022.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3023–3032, 2021.
- Pmet: Precise model editing in a transformer. arXiv preprint arXiv:2308.08742, 2023a.
- Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350, 2023b.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022a.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022b.
- Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2022c.
- Fast model editing at scale. arXiv preprint arXiv:2110.11309, 2021.
- Memory-based model editing at scale. In International Conference on Machine Learning, pp. 15817–15831. PMLR, 2022a.
- Memory-based model editing at scale. In International Conference on Machine Learning, pp. 15817–15831. PMLR, 2022b.
- Fixing model bugs with natural language patches. arXiv preprint arXiv:2211.03318, 2022.
- Forgetting before learning: Utilizing parametric arithmetic for knowledge updating in large language models. arXiv preprint arXiv:2311.08011, 2023.
- Can lms learn new entities from descriptions? challenges in propagating injected knowledge. arXiv preprint arXiv:2305.01651, 2023.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
- Defending against stealthy backdoor attacks. arXiv preprint arXiv:2205.14246, 2022.
- Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
- Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning, pp. 9389–9398. PMLR, 2021.
- Bddr: An effective defense against textual backdoor attacks. Computers & Security, 110:102433, 2021.
- Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. arXiv preprint arXiv:2304.12298, 2023.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
- Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022, pp. 652–660, 2022.
- Stanford alpaca: an instruction-following llama model. URL: https://github. com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
- Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218, 2023.
- Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710, 2023.
- Character-level convolutional networks for text classification. In NIPS, 2015.
- How to inject backdoors with better consistency: Logit anchoring on clean data. In International Conference on Learning Representations, 2021a.
- Neural network surgery: Injecting data patterns into pre-trained models with minimal instance-wise side effects. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5453–5466, 2021b.
- Yanzhou Li (5 papers)
- Tianlin Li (43 papers)
- Kangjie Chen (16 papers)
- Jian Zhang (542 papers)
- Shangqing Liu (28 papers)
- Wenhan Wang (22 papers)
- Tianwei Zhang (199 papers)
- Yang Liu (2253 papers)