Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content (2402.13926v1)
Abstract: The risks derived from LLMs generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.
- Gpt-4 technical report. 2023.
- Anthropic, 2023. URL https://claude.ai/.
- Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In ICLR, 2024.
- Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788, 2023a.
- Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023b.
- Safe rlhf: Safe reinforcement learning from human feedback. ArXiv preprint, abs/2310.12773, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv preprint, abs/2209.07858, 2022. URL https://api.semanticscholar.org/CorpusID:252355458.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206, 2022. doi: 10.1162/tacl_a_00454.
- Trustgpt: A benchmark for trustworthy and responsible large language models. ArXiv preprint, abs/2306.11507, 2023. URL https://api.semanticscholar.org/CorpusID:259202452.
- A watermark for large language models. ArXiv preprint, abs/2301.10226, 2023.
- Flirt: Feedback loop in-context red teaming. ArXiv preprint, abs/2308.04265, 2023.
- On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! ArXiv preprint, abs/2310.03693, 2023.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models. ArXiv preprint, abs/2308.01263, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Jailbroken: How does llm safety training fail? ArXiv preprint, abs/2307.02483, 2023.
- Ethical and social risks of harm from language models. ArXiv preprint, abs/2112.04359, 2021. URL https://api.semanticscholar.org/CorpusID:244954639.
- Shadow alignment: The ease of subverting safely-aligned language models. ArXiv preprint, abs/2310.02949, 2023.
- Universal and transferable adversarial attacks on aligned language models. ArXiv preprint, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.