Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An LLM can Fool Itself: A Prompt-Based Adversarial Attack (2310.13345v1)

Published 20 Oct 2023 in cs.CR

Abstract: The wide-ranging applications of LLMs, especially in safety-critical domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. “real attackers don’t compute gradients”: Bridging the gap between adversarial ml research and practice. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.  339–364. IEEE, 2023.
  2. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. PMLR, 2018.
  3. The second pascal recognising textual entailment challenge. Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 01 2006.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  5. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009. NIST, 2009. URL https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Recognising textual entailment with logical inference. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 628–635, 2005.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Artificial intelligence in medicine: current trends and future possibilities. British Journal of General Practice, 68(668):143–144, 2018.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  11. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp. 2206–2216. PMLR, 2020.
  12. The pascal recognising textual entailment challenge. pp.  177–190, 01 2005. ISBN 978-3-540-33427-9. doi: 10.1007/11736790˙9.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. IEEE, 2018.
  15. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
  16. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  17. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, 2020.
  18. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp.  1–9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401.
  19. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  20. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1875–1885, 2018.
  21. Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932, 2:10, 2019.
  22. Adversarial examples in the physical world. In Artificial intelligence safety and security, pp.  99–112. Chapman and Hall/CRC, 2018.
  23. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
  24. Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984, 2020.
  25. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023a.
  26. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023b.
  27. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023c.
  28. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023d.
  29. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  30. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353, 2021.
  31. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  32. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7838–7847, 2021.
  33. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  34. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552, 2023.
  35. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
  36. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pp.  2340–2353, 2018.
  37. OpenAI. Gpt-4 technical report, 2023.
  38. A robust hierarchical graph convolutional network model for collaborative filtering. arXiv preprint arXiv:2004.14734, 2020.
  39. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  42. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  43. Beyond accuracy: Behavioral testing of nlp models with checklist. In Annual Meeting of the Association for Computational Linguistics, 2020.
  44. Role-play with large language models. arXiv preprint arXiv:2305.16367, 2023.
  45. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  46. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp.  2659–2673, 2022.
  47. Large language models encode clinical knowledge. Nature, pp.  1–9, 2023.
  48. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  49. Pre-trained large language models for industrial control. arXiv preprint arXiv:2308.03028, 2023.
  50. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  51. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  53. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  54. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  55. Semattack: Natural textual attacks via different semantic spaces. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  176–205, 2022.
  56. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023a.
  57. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095, 2023b.
  58. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp.  4144–4150, 2017.
  59. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  60. The multi-genre nli corpus. 2018.
  61. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE international conference on computer vision, pp.  1369–1378, 2017.
  62. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6066–6080, 2020.
  63. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  64. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  65. Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations, 2019.
  66. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xilie Xu (10 papers)
  2. Keyi Kong (4 papers)
  3. Ning Liu (199 papers)
  4. Lizhen Cui (66 papers)
  5. Di Wang (407 papers)
  6. Jingfeng Zhang (66 papers)
  7. Mohan Kankanhalli (117 papers)
Citations (43)
X Twitter Logo Streamline Icon: https://streamlinehq.com