Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content (2402.13926v1)

Published 21 Feb 2024 in cs.CL and cs.AI

Abstract: The risks derived from LLMs generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.

References (21)