Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models (2404.00629v2)

Published 31 Mar 2024 in cs.CL

Abstract: Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of LLMs. Additionally, we have developed the "searcher" framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (345)
  1. Computational red teaming: Past, present and future. IEEE Computational Intelligence Magazine, 6(1):30–42, 2011.
  2. Securing large language models: Threats,vulnerabilities and responsible practices. arXiv:2403.12503, 2024.
  3. Physics of language models: Part 3.2, knowledge manipulation. arXiv:2309.14402, 2023.
  4. Detecting language model attacks with perplexity. arxiv:2308.14132v3, 2023. URL http://arxiv.org/abs/2308.14132v3.
  5. Gpt in sheep’s clothing: The risk of customized gpts. arxiv:2401.09075v1, 2024. URL http://arxiv.org/abs/2401.09075v1.
  6. Dices dataset: Diversity in conversational ai evaluation for safety. arxiv:2306.11247v1, 2023. URL http://arxiv.org/abs/2306.11247v1.
  7. A general language assistant as a laboratory for alignment. arXiv:2112.00861, 2021.
  8. Abusing images and sounds for indirect instruction injection in multi-modal llms. arXiv:2307.10490, 2023.
  9. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  10. How (un)ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries. arXiv:2402.15302, 2024.
  11. The reversal curse: Llms trained on ”a is b” fail to learn ”b is a”. arXiv:2309.12288, 2023.
  12. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv:2310.14303, 2023a.
  13. Red-teaming large language models using chain of utterances for safety-alignment. arXiv:2308.09662, 2023b.
  14. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arxiv:2309.07875v2, 2023. URL http://arxiv.org/abs/2309.07875v2.
  15. Language models are few-shot learners. arXiv:2005.14165, 2020.
  16. Defending against alignment-breaking attacks via robustly aligned llm. arXiv:2309.14348, 2023a.
  17. Stealthy and persistent unalignment on large language models via backdoor injections. arXiv:2312.00027, 2023b.
  18. Deceptive alignment monitoring. arXiv preprint arXiv:2307.10569, 2023.
  19. HateBERT: Retraining BERT for abusive language detection in English. In Aida Mostafazadeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodkumar Prabhakaran, and Zeerak Waseem (eds.), Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pp.  17–25, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.woah-1.3. URL https://aclanthology.org/2021.woah-1.3.
  20. Explore, establish, exploit: Red teaming language models from scratch. arXiv:2306.09442, 2023.
  21. Detection and defense against prominent attacks on preconditioned llm-integrated virtual assistants. arXiv:2401.00994, 2024.
  22. Play guessing game with llm: Indirect jailbreak attack with implicit clues. arXiv:2402.09091, 2024.
  23. Jailbreaking black box large language models in twenty queries. arxiv:2310.08419v2, 2023. URL http://arxiv.org/abs/2310.08419v2.
  24. Jailbreaker in jail: Moving target defense for large language models. arxiv:2310.02417v1, 2023a. URL http://arxiv.org/abs/2310.02417v1.
  25. Evaluating large language models trained on code. arXiv:2107.03374, 2021.
  26. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363, 2024.
  27. The janus interface: How fine-tuning in large language models amplifies the privacy risks. arxiv:2310.15469v1, 2023b. URL http://arxiv.org/abs/2310.15469v1.
  28. Leveraging the context through multi-round interactions for jailbreaking attacks. arXiv:2402.09177, 2024.
  29. Combating adversarial attacks with multi-agent debate. arxiv:2401.05998v1, 2024. URL http://arxiv.org/abs/2401.05998v1.
  30. Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts. arXiv:2309.06135, 2023.
  31. Breaking down the defenses: A comparative survey of attacks on large language models. arXiv:2403.04786, 2024.
  32. Comprehensive assessment of jailbreak attacks against llms. arXiv:2402.05668, 2024a.
  33. Conversation reconstruction attack against gpt models. arXiv:2402.02987, 2024b.
  34. Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com, 2023.
  35. Added toxicity mitigation at inference time for multimodal and massively multilingual translation. arXiv:2311.06532, 2023.
  36. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arxiv:2401.05778v1, 2024. URL http://arxiv.org/abs/2401.05778v1.
  37. On the robustness of large multimodal models against image adversarial attacks. arxiv:2312.03777v2, 2023. URL http://arxiv.org/abs/2312.03777v2.
  38. Safe rlhf: Safe reinforcement learning from human feedback. arxiv:2310.12773v1, 2023. URL http://arxiv.org/abs/2310.12773v1.
  39. Security and privacy challenges of large language models: A survey. arXiv:2402.00888, 2024.
  40. Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  2176–2189, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.143. URL https://aclanthology.org/2023.findings-emnlp.143.
  41. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv:2307.08715, 2023b.
  42. Pandora: Jailbreak gpts by retrieval augmented generation poisoning. arXiv:2402.08416, 2024.
  43. Towards safer generative language models: A survey on safety risks, evaluations, and improvements. arxiv:2302.09270v3, 2023c. URL http://arxiv.org/abs/2302.09270v3.
  44. Multilingual jailbreak challenges in large language models. arXiv:2310.06474, 2023d.
  45. Assessing language model deployment with risk cards. arXiv:2303.18190, 2023. URL https://api.semanticscholar.org/CorpusID:257900638.
  46. Beyond the safeguards: Exploring the security risks of chatgpt. arxiv:2305.08005v1, 2023. URL http://arxiv.org/abs/2305.08005v1.
  47. A security risk taxonomy for large language models. arxiv:2311.11415v1, 2023. URL http://arxiv.org/abs/2311.11415v1.
  48. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arxiv:2311.08268v1, 2023. URL http://arxiv.org/abs/2311.08268v1.
  49. Building guardrails for large language models. arXiv preprint arXiv:2402.01822, 2024a.
  50. How robust is google’s bard to adversarial image attacks? arXiv:2309.11751, 2023.
  51. Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv:2402.09283, 2024b.
  52. Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak. arxiv:2312.04127v1, 2023. URL http://arxiv.org/abs/2312.04127v1.
  53. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv:2305.14387, 2023.
  54. A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models. arxiv:2312.10982v1, 2023. URL http://arxiv.org/abs/2312.10982v1.
  55. Red-teaming for generative ai: Silver bullet or security theater? arXiv:2401.15897, 2024.
  56. Efficient black-box adversarial attacks on neural text detectors. arXiv:2311.01873, 2023.
  57. Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arxiv:2312.02780v1, 2023. URL http://arxiv.org/abs/2312.02780v1.
  58. Improving adversarial transferability of visual-language pre-training models through collaborative multimodal interaction. arXiv:2403.10883, 2024.
  59. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration. arXiv:2311.06062, 2023a.
  60. Safety alignment in nlp tasks: Weakly aligned summarization as an in-context attack. arXiv:2312.06924, 2023b.
  61. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv:2311.00117, 2023.
  62. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  63. Mart: Improving llm safety with multi-round automatic red-teaming. arXiv:2311.07689, 2023.
  64. Coercing llms to do and reveal (almost) anything. arXiv:2402.14020, 2024.
  65. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv:2311.05608, 2023.
  66. Eyes closed,safety on: Protecting multimodal llms via image-to-text transformation. arXiv:2403.09572, 2024.
  67. Mapping the moral domain. Journal of personality and social psychology, 101(2):366, 2011.
  68. Ai control: Improving safety despite intentional subversion. arXiv:2312.06942, 2023.
  69. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv:2302.12173, 2023.
  70. Agent smith: A single image can jailbreak one million multimodal LLM agents exponentially fast. CoRR, abs/2402.08567, 2024. doi: 10.48550/ARXIV.2402.08567. URL https://doi.org/10.48550/arXiv.2402.08567.
  71. Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv:2402.08679, 2024.
  72. Towards safe and aligned large language models for medicine. arXiv:2403.03744, 2024.
  73. Jailbreaking proprietary large language models using word substitution cipher. arXiv:2402.10601, 2024.
  74. Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning. arxiv:2401.10862v1, 2024a. URL http://arxiv.org/abs/2401.10862v1.
  75. Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning. arXiv:2401.10862, 2024b.
  76. Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation. CoRR, abs/2312.04913, 2023a. doi: 10.48550/ARXIV.2312.04913. URL https://doi.org/10.48550/arXiv.2312.04913.
  77. Control risk for potential misuse of artificial intelligence in science. arxiv:2312.06632v1, 2023b. URL http://arxiv.org/abs/2312.06632v1.
  78. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  79. An empirical study of metrics to measure representational harms in pre-trained language models. arXiv:2301.09211, 2023.
  80. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. arXiv:2403.00867, 2024.
  81. Token-level adversarial prompt detection based on perplexity measures and contextual information. arXiv:2311.11509, 2023.
  82. Trustagent: Towards safe and trustworthy llm-based agents through agent constitution. arXiv:2402.01586, 2024.
  83. Flames: Benchmarking value alignment of chinese large language models. arXiv:2311.06899, 2023a.
  84. Catastrophic jailbreak of open-source llms via exploiting generation. arxiv:2310.06987v1, 2023b. URL http://arxiv.org/abs/2310.06987v1.
  85. Robustness tests for automatic machine translation metrics with adversarial attacks. arXiv:2311.00508, 2023.
  86. Sleeper agents: Training deceptive llms that persist through safety training. arxiv:2401.05566v3, 2024. URL http://arxiv.org/abs/2401.05566v3.
  87. Walking a tightrope – evaluating large language models in high-risk domains. arxiv:2311.14966v1, 2023. URL http://arxiv.org/abs/2311.14966v1.
  88. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv:2312.06674v1, 2023. URL http://arxiv.org/abs/2312.06674v1.
  89. Summon a demon and bind it: A grounded theory of llm red teaming in the wild. arxiv:2311.06237v2, 2023. URL http://arxiv.org/abs/2311.06237v2.
  90. Baseline defenses for adversarial attacks against aligned language models. arxiv:2309.00614v2, 2023. URL http://arxiv.org/abs/2309.00614v2.
  91. Joonhyun Jeong. Hijacking context in large multi-modal models. arXiv:2312.07553, 2023.
  92. Defending large language models against jailbreak attacks via semantic smoothing. arXiv:2402.16192, 2024.
  93. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv:2307.04657, 2023.
  94. Identifying and mitigating vulnerabilities in llm-integrated applications. arXiv:2311.16153, 2023a.
  95. Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv:2402.11753, 2024.
  96. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv:2310.05736, 2023b.
  97. Forcing generative models to degenerate ones: The power of data poisoning attacks. arxiv:2312.04748v1, 2023c. URL http://arxiv.org/abs/2312.04748v1.
  98. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arxiv:2310.10077v1, 2023d. URL http://arxiv.org/abs/2310.10077v1.
  99. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv:2402.03299, 2024.
  100. Backdoor attacks for in-context learning with language models. arXiv:2307.14692, 2023.
  101. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv:2302.05733, 2023.
  102. Learn what not to learn: Towards generative safety in chatbots. arXiv:2304.11220, 2023.
  103. Edward Kim. Nevermind: Instruction override and moderation in large language models. arXiv:2402.03303, 2024.
  104. Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement. arXiv:2402.15180, 2024.
  105. Robust safety classifier for large language models: Adversarial prompt shield. arXiv:2311.00172, 2023.
  106. Evaluating language-model agents on realistic autonomous tasks. arXiv:2312.11671, 2024.
  107. Researching alignment research: Unsupervised analysis. arXiv:2206.02841, 2022.
  108. Examining gender and race bias in two hundred sentiment analysis systems. arXiv preprint arXiv:1805.04508, 2018.
  109. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv:2303.05453, 2023.
  110. Understanding the effects of rlhf on llm generalisation and diversity. arXiv:2310.06452, 2024.
  111. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017. ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URL http://dx.doi.org/10.1073/pnas.1611835114.
  112. Can llms recognize toxicity? structured toxicity investigation framework and semantic-based metric. arXiv:2402.06900, 2024.
  113. Certifying llm safety against adversarial prompting. arxiv:2309.02705v2, 2023. URL http://arxiv.org/abs/2309.02705v2.
  114. The ethics of interaction: Mitigating security threats in llms. arXiv:2401.12273, 2024.
  115. The history and risks of reinforcement learning and human feedback. arxiv:2310.13595v2, 2023. URL http://arxiv.org/abs/2310.13595v2.
  116. Open sesame! universal black box jailbreaking of large language models. arXiv:2309.01446, 2023.
  117. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv:2401.01967, 2024.
  118. Benjamin Lemkin. Using hallucinations to bypass gpt4’s filter. arXiv:2403.04769, 2024.
  119. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arxiv:2310.20624v1, 2023. URL http://arxiv.org/abs/2310.20624v1.
  120. Privacy in large language models: Attacks, defenses and future directions. arxiv:2310.10383v1, 2023a. URL http://arxiv.org/abs/2310.10383v1.
  121. Multi-step jailbreaking privacy attacks on chatgpt. arxiv:2304.05197v3, 2023b. URL http://arxiv.org/abs/2304.05197v3.
  122. A cross-language investigation into jailbreak attacks in large language models. arXiv:2401.16765, 2024a.
  123. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv:2402.05044, 2024b.
  124. Red teaming visual language models. CoRR, abs/2401.12915, 2024c. doi: 10.48550/ARXIV.2401.12915. URL https://doi.org/10.48550/arXiv.2401.12915.
  125. Open the pandora’s box of llms: Jailbreaking llms through representation engineering. arxiv:2401.06824v1, 2024d. URL http://arxiv.org/abs/2401.06824v1.
  126. Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms. arXiv:2402.14872, 2024e.
  127. Deepinception: Hypnotize large language model to be jailbreaker. arXiv:2311.03191, 2023c.
  128. Badedit: Backdooring large language models by model editing. arXiv:2403.13355, 2024f.
  129. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. arXiv:2403.09792, 2024g.
  130. Personal llm agents: Insights and survey about the capability,efficiency and security. arXiv:2401.05459, 2024h.
  131. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3214–3252. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.229. URL https://doi.org/10.18653/v1/2022.acl-long.229.
  132. Mitigating the alignment tax of rlhf, 2024.
  133. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv:2310.17389, 2023.
  134. Goal-oriented prompt attack and safety evaluation for llms. arxiv:2309.11830v2, 2023a. URL http://arxiv.org/abs/2309.11830v2.
  135. Visual instruction tuning. arXiv:2304.08485, 2023b.
  136. Lora-as-an-attack! piercing llm safety under the share-and-play scenario. arXiv:2403.00108, 2024a.
  137. Instruct2attack: Language-guided semantic adversarial attacks. arXiv:2311.15551, 2023c.
  138. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. arXiv:2402.18104, 2024b.
  139. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv:2310.04451, 2023d.
  140. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. arXiv:2311.17600, 2023e.
  141. Query-relevant images jailbreak large multi-modal models. CoRR, abs/2311.17600, 2023f. doi: 10.48550/ARXIV.2311.17600. URL https://doi.org/10.48550/arXiv.2311.17600.
  142. Prompt injection attack against llm-integrated applications. arxiv:2306.05499v1, 2023g. URL http://arxiv.org/abs/2306.05499v1.
  143. Jailbreaking chatgpt via prompt engineering: An empirical study. arxiv:2305.13860v1, 2023h. URL http://arxiv.org/abs/2305.13860v1.
  144. Groot: Adversarial testing for generative text-to-image models with tree-based semantic transformation. arXiv preprint arXiv:2402.12100, 2024c.
  145. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
  146. Prompt injection attacks and defenses in llm-integrated applications. arXiv:2310.12815, 2023i.
  147. A safe harbor for ai evaluation and red teaming. arXiv:2403.04893, 2024.
  148. Test-time backdoor attacks on multimodal large language models. arXiv:2402.08577, 2024.
  149. Ensuring safe and high-quality outputs: A guideline library approach for language models. arXiv:2403.11838, 2024.
  150. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv:2308.08747, 2023.
  151. Red teaming game: A game-theoretic framework for red teaming language models. arXiv:2310.00322, 2023.
  152. Prp: Propagating universal perturbations to attack large language model guard-rails. arXiv:2402.15911, 2024.
  153. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
  154. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv:2402.04249, 2024.
  155. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv:2309.13638, 2023.
  156. Using in-context learning to improve dialogue safety. arxiv:2302.00871v3, 2023. URL http://arxiv.org/abs/2302.00871v3.
  157. Flirt: Feedback loop in-context red teaming. arXiv:2308.04265, 2023a.
  158. Jab: Joint adversarial prompting and belief augmentation. arXiv:2311.09473, 2023b.
  159. Tree of attacks: Jailbreaking black-box llms automatically. arxiv:2312.02119v1, 2023. URL http://arxiv.org/abs/2312.02119v1.
  160. Meta. Llama 2 - acceptable use policy. https://ai.meta.com/llama/use-policy/, 2023a.
  161. Meta. Responsible use guide: your resource for building responsibly, 2023b. URL https://llama.meta.com/responsible-use-guide/.
  162. Silo language models: Isolating legal risk in a nonparametric datastore. arXiv:2308.04430, 2023. URL https://api.semanticscholar.org/CorpusID:260704206.
  163. A trembling house of cards? mapping adversarial attacks against language agents. arXiv:2402.10196, 2024a.
  164. Test-time backdoor mitigation for black-box large language models with defensive demonstrations. arXiv:2311.09763, 2023.
  165. Studious bob fight back against jailbreaking via prompt adversarial tuning. arXiv:2402.06255, 2024b.
  166. Adversarial text purification: A large language model approach for defense. arXiv:2402.06655, 2024.
  167. Use of llms for illicit purposes: Threats,prevention measures,and vulnerabilities. arXiv:2308.12833, 2023.
  168. Can llms follow simple rules? arxiv:2311.04235v1, 2023. URL http://arxiv.org/abs/2311.04235v1.
  169. Testing language model agents safely in the wild. arxiv:2311.10538v3, 2023. URL http://arxiv.org/abs/2311.10538v3.
  170. Poisoned chatgpt finds work for idle hands: Exploring developers’ coding practices with insecure suggestions from poisoned ai models. arXiv:2312.06227, 2023.
  171. OpenAI. Openai usage policies. https://openai.com/policies/usage-policies/, 2024.
  172. Training language models to follow instructions with human feedback. arXiv:2203.02155, 2022.
  173. Attacking llm watermarks by exploiting their strengths. arXiv:2402.16187, 2024.
  174. Mapping llm security landscapes: A comprehensive stakeholder risk assessment proposal. arXiv:2403.13309, 2024.
  175. Ai deception: A survey of examples, risks, and potential solutions. arxiv:2308.14752v1, 2023. URL http://arxiv.org/abs/2308.14752v1.
  176. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. arXiv:2403.03792, 2024.
  177. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arxiv:2309.17410v1, 2023. URL http://arxiv.org/abs/2309.17410v1.
  178. From prompt injections to sql injection attacks: How protected is your llm-integrated web application? arxiv:2308.01990v3, 2023. URL http://arxiv.org/abs/2308.01990v3.
  179. Exploiting novel gpt-4 apis. arXiv:2312.14302, 2023.
  180. Gpt-3.5 turbo fine-tuning and api updates, 2023. URL https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.
  181. Red teaming language models with language models. arXiv:2202.03286, 2022.
  182. Llm self defense: By self examination,llms know they are being tricked. arXiv:2308.07308, 2023.
  183. Mllm-protector: Ensuring mllm’s safety without hurting performance. arXiv:2401.02906, 2024.
  184. Jatmo: Prompt injection defense by task-specific finetuning. arXiv:2312.17673, 2023.
  185. Bergeron: Combating adversarial attacks through a conscience-based alignment framework. arXiv:2312.00029, 2023.
  186. Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
  187. Visual adversarial examples jailbreak aligned large language models. arxiv:2306.13213v2, 2023a. URL http://arxiv.org/abs/2306.13213v2.
  188. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv:2310.03693, 2023b.
  189. Hijacking large language models via adversarial in-context learning. arXiv:2311.09948, 2023.
  190. Learning to poison large language models during instruction tuning. arXiv:2402.13459, 2024.
  191. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. arxiv:2307.08487v3, 2023. URL http://arxiv.org/abs/2307.08487v3.
  192. Vision-llms can fool themselves with self-generated typographic attacks. arXiv:2402.00626, 2024.
  193. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv:2311.08592, 2023.
  194. Guardian: A multi-tiered defense architecture for thwarting prompt injection attacks on llms. Journal of Software Engineering and Applications, 17(1):43–68, 2024.
  195. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment. arXiv:2402.14016, 2024.
  196. Universal jailbreak backdoors from poisoned human feedback. arXiv:2311.14455, 2023.
  197. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arxiv:2305.14965v1, 2023. URL http://arxiv.org/abs/2305.14965v1.
  198. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arxiv:2310.10501v1, 2023. URL http://arxiv.org/abs/2310.10501v1.
  199. Adversarial attacks and defenses in deep learning. Engineering, 6(3):346–360, 2020.
  200. Smoothllm: Defending large language models against jailbreaking attacks. arXiv:2310.03684, 2023.
  201. Towards red teaming in multimodal and multilingual translation. arXiv:2401.16247, 2024.
  202. Immunization against harmful fine-tuning attacks. arXiv:2402.16382, 2024.
  203. An early categorization of prompt injection attacks on large language models. arXiv:2402.00898, 2024.
  204. Generating phishing attacks using chatgpt. arXiv:2305.05133, 2023a.
  205. From chatbots to phishbots? – preventing phishing scams created using chatgpt,google bard and claude. arXiv:2310.19181, 2023b.
  206. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv:2309.15817, 2023.
  207. Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191.
  208. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv:2308.01263, 2023.
  209. Fast adversarial attacks on language models in one gpu minute. arXiv:2402.15570, 2024.
  210. Maatphor: Automated variant analysis for prompt injection attacks. arXiv:2312.11513, 2023.
  211. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. arXiv:2401.03729, 2024.
  212. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108, 2020.
  213. On the adversarial robustness of multi-modal foundation models. arXiv:2308.10741, 2023.
  214. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. In Empirical Methods in Natural Language Processing, 2023.
  215. Adversarial attacks and defenses in large language models: Old and new threats. arxiv:2310.19737v1, 2023. URL http://arxiv.org/abs/2310.19737v1.
  216. Prompt stealing attacks against large language models. arXiv:2402.12959, 2024.
  217. Scalable and transferable black-box jailbreaks for language models via persona modulation. arxiv:2311.03348v2, 2023. URL http://arxiv.org/abs/2311.03348v2.
  218. Asymmetric bias in text-to-image generation with adversarial attacks. arxiv:2312.14440v1, 2023. URL http://arxiv.org/abs/2312.14440v1.
  219. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv:2212.08061, 2023.
  220. Spml: A dsl for defending language models against prompt attacks. arXiv:2402.11755, 2024.
  221. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2023a.
  222. Survey of vulnerabilities in large language models revealed by adversarial attacks. arxiv:2310.10844v1, 2023b. URL http://arxiv.org/abs/2310.10844v1.
  223. Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia. arXiv:2402.05467, 2024a.
  224. The language barrier: Dissecting safety challenges of llms in multilingual contexts. arxiv:2401.13136v1, 2024b. URL http://arxiv.org/abs/2401.13136v1.
  225. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv:2308.03825, 2023.
  226. Punctuation matters! stealthy backdoor attack for language models. arXiv:2312.15867, 2023.
  227. Navigating the overkill in large language models. arXiv:2401.17633, 2024.
  228. Safer-instruct: Aligning language models with automated preference data. arXiv:2311.08685, 2023.
  229. Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models. arXiv:2401.09002, 2024.
  230. Exploiting large language models (llms) through deception techniques and persuasion principles. arXiv:2311.14876, 2023.
  231. Pal: Proxy-guided black-box attack on large language models. arXiv:2402.09674, 2024.
  232. A strongreject for empty jailbreaks. arXiv:2402.10260, 2024.
  233. No offense taken: Eliciting offensiveness from language models. arXiv:2310.00892, 2023.
  234. Varshini Subhash. Can large language models change user preference adversarially? arXiv:2302.10291, 2023.
  235. Why do universal adversarial attacks work on large language models?: Geometry might be the answer. arXiv:2309.00254, 2023.
  236. Safety assessment of chinese large language models. arxiv:2304.10436v1, 2023. URL http://arxiv.org/abs/2304.10436v1.
  237. Trustllm: Trustworthiness in large language models. arXiv:2401.05561, 2024.
  238. Scaling behavior of machine translation with large language models under prompt injection attacks. arXiv:2403.09832, 2024.
  239. Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv:2401.07612, 2024.
  240. Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks. arxiv:2401.09798v2, 2024a. URL http://arxiv.org/abs/2401.09798v2.
  241. Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks, 2024b.
  242. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv:2402.04247, 2024.
  243. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  244. Soft-prompt tuning for large language models to evaluate bias. arXiv:2306.04735, 2023a.
  245. Evil geniuses: Delving into the safety of llm-based agents. arxiv:2311.11855v1, 2023b. URL http://arxiv.org/abs/2311.11855v1.
  246. The promise and peril of artificial intelligence – violet teaming offers a balanced path forward. arXiv:2308.14253, 2023.
  247. Tensor trust: Interpretable prompt injection attacks from an online game. arxiv:2311.01011v1, 2023. URL http://arxiv.org/abs/2311.01011v1.
  248. Who is chatgpt? benchmarking llms’ psychological portrayal using psychobench. arXiv:2310.01386, 2023.
  249. How many unicorns are in this image? a safety evaluation benchmark for vision llms. arxiv:2311.16101v1, 2023. URL http://arxiv.org/abs/2311.16101v1.
  250. The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv:2401.00287, 2023.
  251. Bypassing the safety training of open-source llms with priming attacks. arXiv:2312.12321, 2023.
  252. Veil-Framework. Veil: A payload generation framework. GitHub repository, 2017. URL https://github.com/Veil-Framework/Veil. Available online: https://github.com/Veil-Framework/Veil (accessed on March 16, 2024).
  253. Simplesafetytests: a test suite for identifying critical safety risks in large language models. arXiv:2311.08370, 2023.
  254. Study on the scenario-based application of chatgpt and its risk avoidance strategies from the perspective of information literacy. 2023. URL https://api.semanticscholar.org/CorpusID:262924427.
  255. Beyond boundaries: A comprehensive survey of transferable attacks on ai systems. arXiv:2311.11796, 2023a.
  256. From noise to clarity: Unraveling the adversarial suffix of large language model attacks via translation of text embeddings. arXiv:2402.16006, 2024a.
  257. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arxiv:2311.09433v2, 2023. URL http://arxiv.org/abs/2311.09433v2.
  258. On the exploitability of reinforcement learning with human feedback for large language models. arXiv:2311.09641, 2023b.
  259. Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment. arXiv:2402.14968, 2024b.
  260. All languages matter: On the multilingual safety of large language models. arxiv:2310.00905v1, 2023c. URL http://arxiv.org/abs/2310.00905v1.
  261. Instructta: Instruction-tuned targeted attack for large vision-language models. arXiv preprint arXiv:2312.01886, 2023d.
  262. Dala: A distribution-aware lora-based adversarial attack against language models. arXiv:2311.08598, 2023e.
  263. Defending llms against jailbreaking attacks via backtranslation. arXiv:2402.16459, 2024c.
  264. Fake alignment: Are llms really aligned well? arXiv:2311.05915, 2023f.
  265. Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. arXiv:2403.09513, 2024d.
  266. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv:2308.13387, 2023g.
  267. Self-guard: Empower the llm to safeguard itself. arXiv:2310.15851, 2023h.
  268. Foot in the door: Understanding large language model jailbreaking via cognitive psychology. arXiv:2402.15690, 2024e.
  269. Jailbroken: How does llm safety training fail? arXiv:2307.02483, 2023a.
  270. Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024.
  271. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  272. Jailbreak and guard aligned language models with only few in-context demonstrations. arxiv:2310.06387v1, 2023b. URL http://arxiv.org/abs/2310.06387v1.
  273. Ethical and social risks of harm from language models. arXiv:2112.04359, 2021.
  274. Unveiling the implicit toxicity in large language models. arXiv:2311.17391, 2023.
  275. Gradient-based language model red teaming. arXiv:2401.16656, 2024.
  276. Tradeoffs between alignment and helpfulness in language models. arXiv:2401.16332, 2024.
  277. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions. arXiv:2312.04730, 2023a.
  278. Bloomberggpt: A large language model for finance. arXiv:2303.17564, 2023b.
  279. Jailbreaking gpt-4v via self-adversarial attacks with system prompts. arxiv:2311.09127v2, 2023c. URL http://arxiv.org/abs/2311.09127v2.
  280. Universal prompt optimizer for safe text-to-image generation. CoRR, abs/2402.10882, 2024. doi: 10.48550/ARXIV.2402.10882. URL https://doi.org/10.48550/arXiv.2402.10882.
  281. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv:2401.12242, 2024.
  282. Tastle: Distract large language models for automatic jailbreak attack. arXiv:2403.08424, 2024.
  283. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5:1486–1496, 2023. URL https://api.semanticscholar.org/CorpusID:266289038.
  284. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv:2307.09705, 2023a.
  285. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv:2305.14710, 2023b.
  286. Bot-adversarial dialogue for safe conversational agents. In North American Chapter of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235097625.
  287. Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese. arXiv:2310.05818, 2023c.
  288. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arxiv:2311.09827v1, 2023d. URL http://arxiv.org/abs/2311.09827v1.
  289. An llm can fool itself: A prompt-based adversarial attack. arxiv:2310.13345v1, 2023e. URL http://arxiv.org/abs/2310.13345v1.
  290. Shadowcast: Stealthy data poisoning attacks against vision-language models. arXiv:2402.06659, 2024a.
  291. Linkprompt: Natural and universal adversarial attacks on prompt-based language models. arXiv:2403.16432, 2024.
  292. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv:2402.08983, 2024b.
  293. Llm jailbreak attack versus defense techniques – a comprehensive study. arXiv:2402.13457, 2024c.
  294. TrojLLM: A black-box trojan prompt attack on large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ZejTutd7VY.
  295. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023a.
  296. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv:2402.11208, 2024.
  297. Shadow alignment: The ease of subverting safely-aligned language models. arXiv:2310.02949, 2023b.
  298. Mma-diffusion: Multimodal attack on diffusion models. arXiv preprint arXiv:2311.17516, 2023c.
  299. Sneakyprompt: Jailbreaking text-to-image generative models. arXiv:2305.12082, 2023d.
  300. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv:2309.05274, 2023a.
  301. Tree of thoughts: Deliberate problem solving with large language models. arXiv:2305.10601, 2023b.
  302. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. arXiv:2402.10753, 2024.
  303. Benchmarking and defending against indirect prompt injection attacks on large language models. arxiv:2312.14197v1, 2023. URL http://arxiv.org/abs/2312.14197v1.
  304. A novel evaluation framework for assessing resilience against prompt injection attacks in large language models. arxiv:2401.00991v1, 2024. URL http://arxiv.org/abs/2401.00991v1.
  305. Low-resource languages jailbreak gpt-4. arxiv:2310.02446v2, 2023. URL http://arxiv.org/abs/2310.02446v2.
  306. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253, 2023a.
  307. Assessing prompt injection risks in 200+ custom gpts. arXiv:2311.11538, 2023b.
  308. R-judge: Benchmarking safety risk awareness for llm agents. arXiv:2401.10019, 2024.
  309. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv:2308.06463, 2023.
  310. Round trip translation defence against large language model jailbreaking attacks. arXiv:2402.13517, 2024.
  311. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arxiv:2401.06373v2, 2024a. URL http://arxiv.org/abs/2401.06373v2.
  312. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv:2403.04783, 2024b.
  313. Removing rlhf protections in gpt-4 via fine-tuning. arxiv:2311.05553v2, 2023. URL http://arxiv.org/abs/2311.05553v2.
  314. Transfer attacks and defenses for large language models on coding tasks. arxiv:2311.13445v1, 2023a. URL http://arxiv.org/abs/2311.13445v1.
  315. On the safety of open-sourced large language models: Does alignment really prevent them from being misused? arxiv:2310.01581v1, 2023b. URL http://arxiv.org/abs/2310.01581v1.
  316. Stealthy attack on large language model based recommendation. arXiv:2402.14836, 2024a.
  317. Jade: A linguistics-based safety evaluation platform for large language models. arXiv:2311.00286, 2023c.
  318. Rapid adoption, hidden risks: The dual impact of large language model customization. arXiv:2402.09179, 2024b.
  319. A mutation-based method for multi-modal jailbreaking attack detection. arXiv:2312.10766, 2023d.
  320. Effective prompt extraction from language models. arXiv:2307.06865, 2023e.
  321. Intention analysis prompting makes large language models a good jailbreak defender. arxiv:2401.06561v1, 2024c. URL http://arxiv.org/abs/2401.06561v1.
  322. Intention analysis makes llms a good jailbreak defender. arXiv:2401.06561, 2024d.
  323. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. arXiv:2401.11880, 2024e.
  324. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv:2309.07045, 2023f.
  325. Defending large language models against jailbreaking attacks through goal prioritization. arXiv:2311.09096, 2023g.
  326. Shieldlm: Empowering llms as aligned,customizable and explainable safety detectors. arXiv:2402.16444, 2024f.
  327. ”it’s a fair game”, or is it? examining how users navigate disclosure risks and benefits when using llm-based conversational agents. arxiv:2309.11653v1, 2023h. URL http://arxiv.org/abs/2309.11653v1.
  328. The first to know: How token distributions reveal hidden knowledge in large vision-language models? arXiv:2403.09037, 2024a.
  329. Defending against weight-poisoning backdoor attacks for parameter-efficient fine-tuning. arXiv:2402.12168, 2024b.
  330. Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. arXiv:2401.05949, 2024c.
  331. Causality analysis for evaluating the security of large language models. arXiv:2312.07876, 2023.
  332. Weak-to-strong jailbreaking on large language models. arxiv:2401.17256v1, 2024d. URL http://arxiv.org/abs/2401.17256v1.
  333. On prompt-driven safeguarding for large language models. arXiv:2401.18018, 2024.
  334. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv2306.05685, 2023.
  335. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv:2401.17263, 2024a.
  336. Easyjailbreak: A unified framework for jailbreaking large language models. https://github.com/EasyJailbreak/EasyJailbreak, 2024b.
  337. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
  338. Defending jailbreak prompts via in-context adversarial game. arXiv:2402.13148, 2024c.
  339. Emulated disalignment: Safety alignment for large language models may backfire! arXiv:2402.12343, 2024d.
  340. Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue. arXiv:2402.17262, 2024e.
  341. Autodan: Interpretable gradient-based adversarial attacks on large language models. arXiv:2310.15140, 2023.
  342. Red teaming chatgpt via jailbreaking: Bias,robustness,reliability and toxicity. arXiv:2301.12867, 2023.
  343. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv:2402.02207, 2024.
  344. Universal and transferable adversarial attacks on aligned language models. arxiv:2307.15043v2, 2023. URL http://arxiv.org/abs/2307.15043v2.
  345. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models. arXiv:2402.07867, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Lizhi Lin (4 papers)
  2. Honglin Mu (11 papers)
  3. Zenan Zhai (10 papers)
  4. Minghan Wang (23 papers)
  5. Yuxia Wang (41 papers)
  6. Renxi Wang (8 papers)
  7. Junjie Gao (14 papers)
  8. Yixuan Zhang (94 papers)
  9. Wanxiang Che (155 papers)
  10. Timothy Baldwin (125 papers)
  11. Xudong Han (40 papers)
  12. Haonan Li (43 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.

HackerNews