BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models (2401.12242v1)
Abstract: LLMs are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, 2021.
- Palm 2 technical report, 2023.
- DP-InstaHide: Provably Defusing Poisoning and Backdoor Attacks with Differentially Private Data Augmentations. In ICLR Workshop on Security and Safety in Machine Learning Systems, March 2021.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Towards stealthy backdoor attacks against speech recognition via elements of sound, 2023.
- Badprompt: Backdoor attacks on continuous prompts. In Advances in Neural Information Processing Systems, 2022.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), 2021.
- Detecting backdoor attacks on deep neural networks by activation clustering. http://arxiv.org/abs/1811.03728, Nov 2018.
- BadNL: Backdoor Attacks against NLP Models with Semantic-Preserving Improvements, pp. 554–569. 2021.
- Targeted backdoor attacks on deep learning systems using data poisoning. https://arxiv.org/abs/1712.05526v1, 2017.
- Training verifiers to solve math word problems, 2021.
- A survey of man in the middle attacks. IEEE Communications Surveys & Tutorials, 18(3):2027–2051, 2016.
- DriveLM Contributors. Drivelm: Drive on language. https://github.com/OpenDriveLab/DriveLM, 2023.
- Black-box prompt learning for pre-trained language models. Transactions on Machine Learning Research, 2023a.
- Active prompting with chain-of-thought for large language models, 2023b.
- Compositional semantic parsing with large language models. In The Eleventh International Conference on Learning Representations, 2023.
- Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance, 2023.
- Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL), 2021.
- Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(02):1563–1580, 2023.
- BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
- Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Handcrafted backdoors in deep neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Backdoor defense via decoupling the training process. In International Conference on Learning Representations (ICLR), 2022.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 2020.
- Automatically auditing large language models via discrete optimization, 2023.
- Backdoor attacks for in-context learning with language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
- Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021a.
- UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
- PointBA: Towards backdoor attacks in 3d point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021b.
- Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks. In International Conference on Learning Representations (ICLR), 2021c.
- Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–18, 2022.
- Trojaning attack on neural networks. In Network and Distributed System Security (NDSS) Symposium, San Diego, CA, 2018.
- Trojtext: Test-time invisible textual trojan insertion. In The Eleventh International Conference on Learning Representations, 2023.
- NOTABLE: Transferable backdoor attacks against prompt-based NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
- Adversarial learning in statistical classification: A comprehensive review of defenses against attacks. Proceedings of the IEEE, 108:402–433, 2020.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
- OpenAI. Gpt-4 technical report, 2023a.
- OpenAI. OpenAI api reference. https://platform.openai.com/docs/api-reference/chat/create, 2023b.
- Teach GPT to phish. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021a.
- Subnet replacement: Deployment-stage backdoor attack against deep neural networks in gray-box setting. In International Conference on Learning Representations (ICLR) Workshop on Security and Safety in Machine Learning Systems, 2021b.
- Backdoor pre-trained models can transfer to all. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, 2021.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems (NIPS), 2018.
- Poisoning language models during instruction tuning. In International Conference on Machine Learning, 2023.
- Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE Symposium on Security and Privacy (SP), 2019.
- T3: Tree-autoencoder regularized adversarial text generation for targeted attack. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6134–6150, 2020.
- Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In Advances in Neural Information Processing Systems, 2021.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. NeurIPS, 2022a.
- SemAttack: Natural textual attacks via different semantic spaces. 2022b.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2023a.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b.
- Rab: Provable robustness against backdoor attacks. In 2023 2023 IEEE Symposium on Security and Privacy (SP) (SP), pp. 640–657, 2023.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
- Larger language models do in-context learning differently, 2023.
- On decoder-only architecture for speech-to-text and large language model integration, 2023.
- A benchmark study of backdoor data poisoning defenses for deep neural network classifiers and a novel defense. In IEEE MLSP, Pittsburgh, 2019.
- Detection of backdoors in trained classifiers without access to the training set. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2020.
- Detecting backdoor attacks against point cloud classifiers. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- CBD: A certified backdoor detector based on local dominant probability. In Advances in Neural Information Processing Systems (NeurIPS), 2023a.
- UMD: Unsupervised model detection for X2X backdoor attacks. In International Conference on Machine Learning (ICML), 2023b.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models, 2023a.
- Exploring the universal vulnerability of prompt-based learning paradigm. In Findings of the Association for Computational Linguistics: NAACL 2022, 2022.
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model, 2023b.
- Large language models as optimizers, 2023.
- Tree of thoughts: Deliberate problem solving with large language models, 2023a.
- Beyond chain-of-thought, effective graph-of-thought reasoning in large language models, 2023b.
- Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=MeeQkFYVbzW.
- Backdoor attack against speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023a.
- Trojaning language models for fun and profit. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 179–197, 2021.
- Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2023b.
- Clean-label backdoor attacks on video recognition models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Prompt as triggers for backdoor attack: Examining the vulnerability in language models, 2023.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Zhen Xiang (42 papers)
- Fengqing Jiang (18 papers)
- Zidi Xiong (11 papers)
- Bhaskar Ramasubramanian (35 papers)
- Radha Poovendran (100 papers)
- Bo Li (1107 papers)