Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning (2401.05949v6)

Published 11 Jan 2024 in cs.CL, cs.AI, and cs.CR

Abstract: In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks, especially in few-shot settings. Despite being widely applied, in-context learning is vulnerable to malicious attacks. In this work, we raise security concerns regarding this paradigm. Our studies demonstrate that an attacker can manipulate the behavior of LLMs by poisoning the demonstration context, without the need for fine-tuning the model. Specifically, we design a new backdoor attack method, named ICLAttack, to target LLMs based on in-context learning. Our method encompasses two types of attacks: poisoning demonstration examples and poisoning demonstration prompts, which can make models behave in alignment with predefined intentions. ICLAttack does not require additional fine-tuning to implant a backdoor, thus preserving the model's generality. Furthermore, the poisoned examples are correctly labeled, enhancing the natural stealth of our attack method. Extensive experimental results across several LLMs, ranging in size from 1.3B to 180B parameters, demonstrate the effectiveness of our attack method, exemplified by a high average attack success rate of 95.0% across the three datasets on OPT models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  2. Badprompt: Backdoor attacks on continuous prompts. Advances in Neural Information Processing Systems, 35:37068–37080.
  3. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891.
  4. Improving in-context few-shot learning via self-supervised training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3558–3573.
  5. Kallima: A clean-label framework for textual backdoor attacks. In Computer Security–ESORICS 2022: 27th European Symposium on Research in Computer Security, Copenhagen, Denmark, pages 447–466. Springer.
  6. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  7. Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 680–686.
  8. Using punctuation as an adversarial attack on deep learning-based NLP systems: An empirical study. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1–34.
  9. Triggerless backdoor attack for nlp tasks with clean labels. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2942–2952.
  10. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  11. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1563–1580.
  12. A gradient control method for backdoor attacks on parameter-efficient tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3508–3520.
  13. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
  14. A theory of emergent in-context learning as implicit structure induction. arXiv preprint arXiv:2303.07971.
  15. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
  16. Badhash: Invisible backdoor attacks against deep hashing with clean label. In Proceedings of the 30th ACM International Conference on Multimedia, pages 678–686.
  17. Backdoor attacks for in-context learning with language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
  18. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3023–3032.
  19. Finding supporting examples for in-context learning. arXiv preprint arXiv:2302.13539.
  20. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pages 19565–19594. PMLR.
  21. Trojtext: Test-time invisible textual trojan insertion. In The Eleventh International Conference on Learning Representations.
  22. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809.
  23. In-context example selection with influences. arXiv preprint arXiv:2302.11042.
  24. Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11103–11111.
  25. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  26. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  27. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 443–453.
  28. Hijacking large language models via adversarial in-context learning. arXiv preprint arXiv:2311.09948.
  29. Badnl: Backdoor attacks against nlp models. In ICML 2021 Workshop on Adversarial Machine Learning.
  30. Measuring inductive biases of in-context learning with underspecified demonstrations. arXiv preprint arXiv:2305.13299.
  31. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944.
  34. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  35. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE.
  36. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433.
  37. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916.
  38. Symbol tuning improves in-context learning in language models. arXiv preprint arXiv:2305.08298.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  40. Badchain: Backdoor chain-of-thought prompting for large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
  41. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
  42. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848.
  43. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
  44. Exploring the universal vulnerability of prompt-based learning paradigm. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1799–1810.
  45. Poisonprompt: Backdoor attack on prompt-based large language models. arXiv preprint arXiv:2310.12439.
  46. Compositional exemplars for in-context learning. arXiv preprint arXiv:2302.05698.
  47. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1415–1420.
  48. Instruct me more! random prompting for visual in-context learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2597–2606.
  49. Planning with large language models for code generation. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
  50. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  51. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  52. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134–9148.
  53. Certified robustness against natural language attacks by causal intervention. In International Conference on Machine Learning, pages 26958–26970. PMLR.
  54. From softmax to nucleusmax: A novel sparse language model for chinese radiology report summarization. ACM Transactions on Asian and Low-Resource Language Information Processing.
  55. Sparsing and smoothing for the seq2seq models. IEEE Transactions on Artificial Intelligence.
  56. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12303–12317.
  57. Autodan: Automatic and interpretable adversarial attacks on large language models. In Socially Responsible Language Modelling Research.
  58. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shuai Zhao (116 papers)
  2. Meihuizi Jia (5 papers)
  3. Luu Anh Tuan (55 papers)
  4. Jinming Wen (40 papers)
  5. Fengjun Pan (6 papers)
Citations (22)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets