Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! (2402.12343v4)
Abstract: LLMs undergo safety alignment to ensure safe conversations with humans. However, this paper introduces a training-free attack method capable of reversing safety alignment, converting the outcomes of stronger alignment into greater potential for harm by accessing only LLM output token distributions. Specifically, our method achieves this reversal by contrasting the output token distribution of a safety-aligned LLM (e.g., Llama-2-chat) against its pre-trained version (e.g., Llama-2), so that the token predictions are shifted towards the opposite direction of safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward. Our experiments with ED across three evaluation datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets by a large margin. Eventually, given ED's reliance on LLM output token distributions, which particularly compromises open-source models, our findings highlight the need to reassess the open accessibility of LLMs, even if they have been safety-aligned. Code is available at https://github.com/ZHZisZZ/emulated-disalignment.
- Anthropic. Model card and evaluations for claude models, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021.
- Jailbreaking black box large language models in twenty queries, 2023.
- Can llm-generated misinformation be detected?, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Attacks, defenses and evaluations for llm conversation safety: A survey, 2024.
- Bias and fairness in large language models: A survey, 2023.
- Scaling laws for reward model overoptimization, 2022.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020.
- Composable deep reinforcement learning for robotic manipulation, 2018.
- Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
- Mistral 7b, 2023.
- Contrastive decoding: Open-ended text generation as optimization, 2023a.
- Deepinception: Hypnotize large language model to be jailbreaker, 2023b.
- The unlocking spell on base llms: Rethinking alignment via in-context learning, 2023a.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023b.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2023.
- A holistic approach to undesired content detection in the real world, 2023.
- An emulator for fine-tuning large language models using small language models, 2023.
- Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
- Reasoning with language model prompting: A survey, 2023.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
- Trusting your evidence: Hallucinate less with context-aware decoding, 2023.
- ”i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset, 2022.
- Learning to summarize from human feedback, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Zephyr: Direct distillation of lm alignment, 2023.
- Unveiling the implicit toxicity in large language models, 2023.
- Weak-to-strong jailbreaking on large language models, 2024.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Zhanhui Zhou (13 papers)
- Jie Liu (492 papers)
- Zhichen Dong (4 papers)
- Jiaheng Liu (100 papers)
- Chao Yang (333 papers)
- Wanli Ouyang (358 papers)
- Yu Qiao (563 papers)