Adversarial Demonstration Attacks on Large Language Models (2305.14950v2)
Abstract: With the emergence of more powerful LLMs, such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precondition prompts. While incorporating demonstrations can greatly enhance the performance of LLMs across various tasks, it may introduce a new security concern: attackers can manipulate only the demonstrations without changing the input to perform an attack. In this paper, we investigate the security concern of ICL from an adversarial perspective, focusing on the impact of demonstrations. We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models. Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decrease. Additionally, we also identify the intrinsic property of the demonstrations is that they can be used (prepended) with different inputs. As a result, it introduces a more practical threat model in which an attacker can attack the test input example even without knowing and manipulating it. To achieve it, we propose the transferable version of advICL, named Transferable-advICL. Our experiment shows that the adversarial demonstration generated by Transferable-advICL can successfully attack the unseen test input examples. We hope that our study reveals the critical security risks associated with ICL and underscores the need for extensive research on the robustness of ICL, particularly given its increasing significance in the advancement of LLMs.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175.
- On the relation between sensitivity and accuracy in in-context learning. arXiv preprint arXiv:2209.07661.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pages 177–190. Springer.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
- Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- Certified robustness to adversarial word substitutions. arXiv preprint arXiv:1909.00986.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, pages 8018–8025.
- Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator. arXiv preprint arXiv:2206.08082.
- Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271.
- Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
- Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
- Stress test evaluation for natural language inference. arXiv preprint arXiv:1806.00692.
- Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016-2016 IEEE Military Communications Conference, pages 49–54. IEEE.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
- Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Florence, Italy. Association for Computational Linguistics.
- Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Ellen M Voorhees and Dawn M Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207.
- Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
- Self-adaptive in-context learning. arXiv preprint arXiv:2212.10375.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080.
- Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online. Association for Computational Linguistics.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
- Jiongxiao Wang (15 papers)
- Zichen Liu (34 papers)
- Keun Hee Park (6 papers)
- Zhuojun Jiang (3 papers)
- Zhaoheng Zheng (12 papers)
- Zhuofeng Wu (10 papers)
- Muhao Chen (159 papers)
- Chaowei Xiao (110 papers)