Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (2310.10844v1)

Published 16 Oct 2023 in cs.CL, cs.CR, and cs.LG
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Abstract: LLMs are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of LLMs, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

Surveying the Landscape of Adversarial Attacks on LLMs

Adversarial Attack Categories and Their Impact on LLMs

Adversarial attacks present significant challenges for the robustness and security of LLMs, with implications for their integration into complex systems and applications. This survey categorizes these attacks into three primary classes: unimodal text-based attacks, multimodal attacks, and attacks targeting complex systems that incorporate LLMs. Each category reflects a unique vector through which these models can be compromised, from prompt injections and jailbreaks to exploiting multimodal inputs and the intricate interconnections within multi-agent systems. Understanding these attack vectors is crucial for developing effective defensive mechanisms.

Unimodal Attacks: Jailbreaks and Prompt Injections

  • Jailbreak Attacks: Aimed at bypassing safety alignments through creatively crafted prompts, these attacks force LLMs to generate prohibited output. Such vulnerabilities highlight the challenges in achieving full alignment with human preferences and the need for comprehensive safety measures.
  • Prompt Injection Attacks: These involve manipulating the model's inputs through adversarially crafted prompts, leading to undesired or deceptive outputs. Prompt injections exploit the instructional capabilities of LLMs, coercing them to prioritize injected instructions over their intended tasks.

Multimodal Attacks: Exploiting Additional Inputs

Multimodal attacks leverage the expanded input space of LLMs that process beyond text, such as images or audio. These attacks introduce adversarial perturbations across different modalities, exploiting vulnerabilities inherent to the processing of non-textual information. The complexity of defending against these attacks underscores the necessity of cross-modality security measures in LLMs.

Attacks on Complex Systems: Targeting LLM Integration

As LLMs become more embedded in systems involving multiple components or agents, the attack surface broadens. This survey identifies specific attacks targeting such integrations, including those exploiting retrieval mechanisms, federated learning architectures, and structured data. The interconnected nature of these systems amplifies the potential impact of successful attacks, necessitating advanced defensive strategies tailored to multi-component environments.

Causes of Vulnerabilities

The survey further explores the underlying causes of these vulnerabilities, from static model characteristics to the lack of comprehensive data coverage and alignment imperfections. These causes are pivotal in understanding how attacks exploit LLMs and serve as a foundation for developing robust defenses.

Defensive Mechanisms

In response to these adversarial threats, various defense strategies have been proposed, ranging from input and output filtering to adversarial training and the employment of human feedback mechanisms. These defenses aim to enhance the resilience of LLMs against adversarial manipulation, ensuring their reliability and safety in practice. However, the evolving nature of adversarial tactics necessitates continual adaptation and improvement of defensive measures.

Conclusion and Future Directions

This survey underscores the multifaceted nature of adversarial attacks against LLMs and the imperative for comprehensive defensive strategies. As LLMs continue to advance and integrate more deeply into various applications and systems, understanding and mitigating these adversarial threats will be critical for ensuring the integrity, security, and trustworthiness of AI-driven solutions. Future research should focus on advancing defensive mechanisms, exploring the interplay between different types of attacks and defenses, and fostering the development of LLMs that are both powerful and resistant to adversarial exploitation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (283)
  1. 2023. Anthropic. “we are offering a new version of our model, claude-v1.3, that is safer and less susceptible to adversarial attacks.”. https://twitter.com/AnthropicAI/status/1648353600350060545/.
  2. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318.
  3. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  4. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198.
  5. “real attackers don’t compute gradients”: Bridging the gap between adversarial ml research and practice. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 339–364. IEEE.
  6. Mostafa M Aref. 2003. A multi-agent system for natural language understanding. In IEMC’03 Proceedings. Managing Technologically Driven Organizations: The Human Side of Innovation and Change (IEEE Cat. No. 03CH37502), pages 36–40. IEEE.
  7. Stuart Armstrong. 2022. Using gpt-eliezer against chatgpt jailbreaking. https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking.
  8. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  9. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490.
  10. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  11. Image hijacking: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236.
  12. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  13. Identifying and mitigating the security risks of generative ai. arXiv preprint arXiv:2308.14840.
  14. Ratgpt: Turning online llms into proxies for malware attacks. arXiv preprint arXiv:2308.09183.
  15. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation.
  16. Towards building a robust toxicity predictor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 581–598.
  17. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
  18. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  19. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 387–402. Springer.
  20. Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems, 30.
  21. Stephen W Boyd and Angelos D Keromytis. 2004. Sqlrand: Preventing sql injection attacks. In Applied Cryptography and Network Security: Second International Conference, ACNS 2004, Yellow Mountain, China, June 8-11, 2004. Proceedings 2, pages 292–302. Springer.
  22. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128.
  23. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  24. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  25. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  26. Matt Burgess. 2023. Hackingchatgpt. the hacking of chatgpt is just getting started. https://www.wired.com/story/chatgpt-jailbreak-generative-ai-hacking/.
  27. Successful Cap. 2023. How to ”jailbreak” bing and not get banned. https://www.reddit.com/r/bing/comments/11s1ge8/how_to_jailbreak_bing_and_not_get_banned/.
  28. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
  29. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  30. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  31. Nicholas Carlini and David Wagner. 2016. Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311.
  32. vic CarperAI. 2023. Stable-vicuna 13b.”. https://huggingface.co/CarperAI/stable-vicuna-13b-delta.
  33. A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology, 6(1):25–45.
  34. Harrison Chase. 2022. LangChain.
  35. Harrison Chase. 2023. Langchain. Accessed: 2023-07-17.
  36. Federated large language model: A position paper. arXiv preprint arXiv:2307.08925.
  37. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
  38. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  39. Pin-Yu Chen and Sijia Liu. 2023. Holistic adversarial robustness of deep learning models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15411–15420.
  40. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25.
  41. Robust neural machine translation with doubly adversarial inputs. arXiv preprint arXiv:1906.02443.
  42. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  43. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  44. Jon Christian. 2023. Amazing ”jailbreak” bypasses chatgpt’s ethics safeguards. https://futurism.com/amazing-jailbreak-chatgpt.
  45. Deep reinforcement learning from human preferences.
  46. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  47. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  48. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  49. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  50. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  51. Bert: Pre-training of deep bidirectional transformers for language understanding.
  52. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  53. Taamr: Targeted adversarial attack against multimedia recommender systems. In 2020 50th Annual IEEE/IFIP international conference on dependable systems and networks workshops (DSN-W), pages 1–8. IEEE.
  54. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.
  55. A survey on in-context learning.
  56. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193.
  57. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  58. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  59. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
  60. Local model poisoning attacks to {{\{{Byzantine-Robust}}\}} federated learning. In 29th USENIX security symposium (USENIX Security 20), pages 1605–1622.
  61. Colin Fraser. 2023. Master thread of ways i have discovered to get chatgpt to output text that it’s not supposed to, including bigotry, urls and personal information, and more. https://twitter.com/colin_fraser/status/1630763219450212355.
  62. Hao Fu, Yao; Peng and Tushar Khot. 2022. How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion.
  63. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
  64. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  65. Black-box generation of adversarial text sequences to evade deep learning classifiers.
  66. The pile: An 800gb dataset of diverse text for language modeling.
  67. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  68. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 3356–3369.
  69. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  70. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  71. Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719.
  72. Multimodal neurons in artificial neural networks. Distill, 6(3):e30.
  73. Multimodal-gpt: A vision and language model for dialogue with humans.
  74. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  75. Explaining and harnessing adversarial examples.
  76. Riley Goodside. 2022. Exploiting gpt-3 prompts with malicious inputs that order the model to ignore its previous directions. https://twitter.com/goodside/status/1569128808308957185?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1569128808308957185%7Ctwgr%5Ecf0062097fb334178bbe266cffea98df9088dc9d%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fsimonwillison.net%2F2022%2FSep%2F12%2Fprompt-injection%2F.
  77. Google-Bard. https://blog.google/technology/ai/google-bard-updates-io-2023/.
  78. A survey of adversarial defences and robustness in nlp.
  79. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39.
  80. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173.
  81. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
  82. Kai Greshakeblog. 2023. Indirect prompt injection threats. https://greshake.github.io/.
  83. injection Guide. 2023. Adversarial prompting guide. https://www.promptingguide.ai/risks/adversarial.
  84. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
  85. Cnn-based projected gradient descent for consistent ct image reconstruction. IEEE transactions on medical imaging, 37(6):1440–1453.
  86. Alexey Guzey. 2023. A two sentence jailbreak for gpt-4 and claude and why nobody knows how to fix it. https://guzey.com/ai/two-sentence-universal-jailbreak/.
  87. Marvin von Hagen. 2023. Sydney bing chat. https://twitter.com/marvinvonhagen/status/1623658144349011971.
  88. A classification of sql-injection attacks and countermeasures.
  89. Fedmlsecurity: A benchmark for attacks and defenses in federated learning and llms. arXiv preprint arXiv:2306.04959.
  90. Deberta: Decoding-enhanced bert with disentangled attention.
  91. Tabllm: Few-shot classification of tabular data with large language models.
  92. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  93. A repository of conversational datasets. In Proceedings of the First Workshop on NLP for Conversational AI, pages 1–10.
  94. Deceiving google’s perspective api built for detecting toxic comments.
  95. Changran Huang. 2021. The intelligent agent nlp-based customer service system. In 2021 2nd International Conference on Artificial Intelligence in Electronics Engineering, pages 41–50.
  96. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  97. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284.
  98. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32.
  99. Fooling explanations in text classifiers.
  100. Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059.
  101. Alex Jailbreakchat. Jailbreakchat. https://www.jailbreakchat.com/.
  102. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  103. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
  104. Is bert really robust? a strong baseline for natural language attack on text classification and entailment.
  105. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  106. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
  107. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, pages 97–117. Springer.
  108. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV).
  109. Rhmd: Evasion-resilient hardware malware detectors. In Proceedings of the 50th Annual IEEE/ACM international symposium on microarchitecture, pages 315–327.
  110. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  111. Adversarial attacks on tables with entity swap. organization, 9904(7122):71–9.
  112. Adversarial malware binaries: Evading deep learning for malware detection in executables. In 2018 26th European signal processing conference (EUSIPCO), pages 533–537. IEEE.
  113. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  114. Adversarial examples for natural language classification problems.
  115. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  116. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
  117. Akash Kushwaha. 2023. Google bard jailbreak: Prompt to bard jailbreak. https://www.gyaaninfinity.com/google-bard-jailbreak-prompts/.
  118. Gandalf Lakera. 2023. Lakera prompt injection challenge. https://gandalf.lakera.ai/.
  119. Huggingface h4 stack exchange preference dataset.
  120. Albert: A lite bert for self-supervised learning of language representations.
  121. PI LangchainWebinar. 2023. Langchain prompt injection webinar. https://www.youtube.com/watch?v=fP6vRNkNEt0.
  122. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE.
  123. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  124. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271.
  125. TextBugger: Generating adversarial text against real-world applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society.
  126. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  127. Unified demonstration retriever for in-context learning.
  128. Competition-level code generation with alphacode. Science (New York, NY), 378(6624):1092–1097.
  129. Deep text classification can be fooled. arXiv preprint arXiv:1704.08006.
  130. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  131. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  132. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  133. Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives.
  134. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  135. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  136. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  137. Roberta: A robustly optimized bert pretraining approach.
  138. Sensitivity of adversarial perturbation in fast gradient sign method. In 2019 IEEE symposium series on computational intelligence (SSCI), pages 433–436. IEEE.
  139. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
  140. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  141. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, and toxicity.
  142. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  143. Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539.
  144. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
  145. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
  146. Deepmem: Ml models as storage channels and their (mis-) applications. arXiv preprint arXiv:2307.08811.
  147. Christopher Manning and Hinrich Schutze. 1999. Foundations of statistical natural language processing. MIT press.
  148. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018.
  149. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172.
  150. Kris McGuffie and Alex Newhouse. 2020. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807.
  151. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479.
  152. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35.
  153. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
  154. Microsoft-Bing. https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat.
  155. An empirical analysis of memorization in fine-tuned autoregressive language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1816–1826, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  156. Models, C. Model card and evaluations for claude models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
  157. OpenAI ModerationOpenAI. Moderation endpoint openai. https://platform.openai.com/docs/guides/moderation/overview.
  158. NLPTeam MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.”. https://www.mosaicml.com/blog/mpt-7b.
  159. Proceedings of the 2014 conference on empirical methods in natural language processing (emnlp). In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  160. Zvi Mowshowitz. 2022. Jailbreaking chatgpt on release day. https://thezvi.substack.com/p/jailbreaking-the-chatgpt-on-release.
  161. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833.
  162. A robust analysis of adversarial attacks on federated learning environments. Computer Standards & Interfaces, page 103723.
  163. Nvidia NeMo-Guardrails. Nemo guardrails; an open-source toolkit for easily adding programmable guardrails to llm-based conversational systems. https://github.com/NVIDIA/NeMo-Guardrails.
  164. David A Noever and Samantha E Miller Noever. 2021. Reading isn’t believing: Adversarial attacks on multi-modal neurons. arXiv preprint arXiv:2103.10480.
  165. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  166. AI OpenAIApplications. 2023. Openai - explore what’s possible with some example applications. https://platform.openai.com/examples.
  167. moderation OpenChatKit. Openchatkit moderation model. https://github.com/togethercomputer/OpenChatKit.
  168. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  169. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331.
  170. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pages 582–597. IEEE.
  171. PI Parea. 2023. The prompt engineering platform to experiment with different prompt versions. https://www.parea.ai/.
  172. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
  173. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr.
  174. From prompt injections to sql injection attacks: How protected is your llm-integrated web application?
  175. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  176. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  177. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448.
  178. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
  179. Google PerspectiveAPI. Google’s perspective api: Using machine learning to reduce toxicity online. https://www.perspectiveapi.com/.
  180. Google PrinciplesGoogle. Google: Our principles. https://ai.google/responsibility/principles/.
  181. Are chatbots ready for privacy-sensitive applications? an investigation into input regurgitation and prompt-induced sanitization. arXiv preprint arXiv:2305.15008.
  182. buysell PromptBase. 2023. Midjourney, chatgpt, dall·e, stable diffusion and more prompt marketplace. https://promptbase.com/.
  183. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213.
  184. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. arXiv preprint arXiv:2307.08487.
  185. Adversarial attack and defense technologies in natural language processing: A survey. Neurocomputing, 492:278–307.
  186. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  187. Improving language understanding by generative pre-training.
  188. Language models are unsupervised multitask learners.
  189. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  190. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  191. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965.
  192. Johann Rehberger. 2023. Image to prompt injection with google bard. https://embracethered.com/blog/posts/2023/google-bard-image-to-prompt-injection/.
  193. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  194. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  195. Interpretability and transparency-driven detection and transformation of textual adversarial examples (it-dt). arXiv preprint arXiv:2307.01225.
  196. Suranjana Samanta and Sameep Mehta. 2017. Towards crafting text adversarial samples.
  197. Roman Samoilenko a. 2023. New prompt injection attack on chatgpt web version. markdown images can steal your chat data. https://systemweakness.com/new-prompt-injection-attack-on-chatgpt-web-version-ef717492c5c2.
  198. Roman Samoilenko b. 2023. New prompt injection attack on chatgpt web version. reckless copy-pasting may lead to serious privacy issues in your chat. https://kajojify.github.io/articles/1_chatgpt_attack.pdf.
  199. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  200. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  201. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755.
  202. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  203. Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. arXiv preprint arXiv:2308.10741.
  204. staff Seclify. 2023. Prompt injection cheat sheet: How to manipulate ai language models. https://blog.seclify.com/prompt-injection-cheat-sheet/.
  205. Universal adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5636–5643.
  206. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR.
  207. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061.
  208. Role-play with large language models. arXiv preprint arXiv:2305.16367.
  209. Plug and pray: Exploiting off-the-shelf components of multi-modal models. arXiv preprint arXiv:2307.14539.
  210. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  211. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  212. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP).
  213. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  214. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  215. That doesn’t go there: Attacks on shared state in multi-user augmented reality applications. arXiv preprint arXiv:2308.09146.
  216. Walker Spider. 2022. Dan is my new friend. https://www.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/.
  217. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  218. Certified defenses for data poisoning attacks. Advances in neural information processing systems, 30.
  219. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  220. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
  221. Vipergpt: Visual inference via python execution for reasoning.
  222. latent Sywx. 2022. Reverse prompt engineering for fun and (no) profit. https://www.latent.space/p/reverse-prompt-eng.
  223. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
  224. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.
  225. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  226. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
  227. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
  228. Bing TermsOfUseBing. Bing conversational experiences and image creator terms. https://www.bing.com/new/termsofuse.
  229. Oguzhan Topsakal and Tahir Cetin Akinci. 2023. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In International Conference on Applied Engineering and Natural Sciences, volume 1, pages 1050–1056.
  230. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  231. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  232. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645.
  233. OpenAI UsagePolicyOpenAI. Usage policies of openai. https://openai.com/policies/usage-policies.
  234. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125.
  235. Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5307–5315, Hong Kong, China. Association for Computational Linguistics.
  236. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  237. Safeguarding crowdsourcing surveys from chatgpt with prompt injection. arXiv preprint arXiv:2306.08833.
  238. Adversarial demonstration attacks on large language models.
  239. A survey on large language model based autonomous agents.
  240. Yicheng Wang and Mohit Bansal. 2018. Robust machine comprehension models via adversarial training. arXiv preprint arXiv:1804.06473.
  241. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  242. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  243. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  244. Chain-of-thought prompting elicits reasoning in large language models.
  245. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  246. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  247. Undersensitivity in neural reading comprehension. arXiv preprint arXiv:2003.04808.
  248. Simon Willison. 2022a. Leaking your prompt. https://simonwillison.net/2022/Sep/12/prompt-injection/.
  249. Simon Willison. 2022b. Prompt injection series. https://simonwillison.net/series/prompt-injection/.
  250. Zack Witten. 2022. Thread of known chatgpt jailbreaks. https://twitter.com/zswitten/status/1598380220943593472?lang=en.
  251. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082.
  252. Eric Wong and Zico Kolter. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR.
  253. PI Writesonic. 2023. Writesonic - an ai-powered writing tool. http://writesonic.com/.
  254. Untargeted adversarial attack via expanding the semantic gap. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 514–519. IEEE.
  255. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  256. Red Wunderwuzzi. 2023. Ai injections: Direct and indirect prompt injections and their implications. https://embracethered.com/blog/posts/2023/ai-injections-direct-and-indirect-prompt-injection-basics/.
  257. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  258. Exploring the universal vulnerability of prompt-based learning paradigm. arXiv preprint arXiv:2204.05239.
  259. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  260. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155.
  261. Instruction in the wild: A user-based instruction dataset.
  262. Virtual prompt injection for instruction-tuned large language models. arXiv preprint arXiv:2307.16888.
  263. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  264. Shadow alignment: The ease of subverting safely-aligned language models.
  265. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  266. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
  267. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  268. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
  269. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485.
  270. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  271. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  272. Text-crs: A generalized certified robustness framework against textual adversarial attacks. arXiv preprint arXiv:2307.16630.
  273. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  274. Yiming Zhang and Daphne Ippolito. 2023. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865.
  275. A survey of large language models. arXiv preprint arXiv:2303.18223.
  276. Generating natural adversarial examples.
  277. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  278. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008.
  279. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity. ACM Computing Surveys, 55(8):1–39.
  280. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  281. Adversarial training for high-stakes reliability. Advances in Neural Information Processing Systems, 35:9274–9286.
  282. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.
  283. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Erfan Shayegani (7 papers)
  2. Md Abdullah Al Mamun (5 papers)
  3. Yu Fu (86 papers)
  4. Pedram Zaree (4 papers)
  5. Yue Dong (61 papers)
  6. Nael Abu-Ghazaleh (31 papers)
Citations (104)
Youtube Logo Streamline Icon: https://streamlinehq.com