Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content (2402.13926v1)

Published 21 Feb 2024 in cs.CL and cs.AI

Abstract: The risks derived from LLMs generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Gpt-4 technical report. 2023.
  2. Anthropic, 2023. URL https://claude.ai/.
  3. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022.
  4. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In ICLR, 2024.
  5. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788, 2023a.
  6. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023b.
  7. Safe rlhf: Safe reinforcement learning from human feedback. ArXiv preprint, abs/2310.12773, 2023.
  8. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv preprint, abs/2209.07858, 2022. URL https://api.semanticscholar.org/CorpusID:252355458.
  9. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301.
  10. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206, 2022. doi: 10.1162/tacl_a_00454.
  11. Trustgpt: A benchmark for trustworthy and responsible large language models. ArXiv preprint, abs/2306.11507, 2023. URL https://api.semanticscholar.org/CorpusID:259202452.
  12. A watermark for large language models. ArXiv preprint, abs/2301.10226, 2023.
  13. Flirt: Feedback loop in-context red teaming. ArXiv preprint, abs/2308.04265, 2023.
  14. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023.
  15. Fine-tuning aligned language models compromises safety, even when users do not intend to! ArXiv preprint, abs/2310.03693, 2023.
  16. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. ArXiv preprint, abs/2308.01263, 2023.
  17. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
  18. Jailbroken: How does llm safety training fail? ArXiv preprint, abs/2307.02483, 2023.
  19. Ethical and social risks of harm from language models. ArXiv preprint, abs/2112.04359, 2021. URL https://api.semanticscholar.org/CorpusID:244954639.
  20. Shadow alignment: The ease of subverting safely-aligned language models. ArXiv preprint, abs/2310.02949, 2023.
  21. Universal and transferable adversarial attacks on aligned language models. ArXiv preprint, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube