Uncovering Safety Risks of Large Language Models through Concept Activation Vector (2404.12038v5)
Abstract: Despite careful safety alignment, current LLMs remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.
- Attacks, defenses and evaluations for llm conversation safety: A survey, 2024.
- Survey of vulnerabilities in large language models revealed by adversarial attacks, 2023.
- OpenAI. Product safety standards, 2024.
- Meta. Overview of meta ai safety policies prepared for the uk ai safety summit, 2024.
- Anthropic. Core views on ai safety: When, why, what, and how, 2023.
- Deep inception networks: A general end-to-end framework for multi-asset quantitative strategies. arXiv preprint arXiv:2307.05522, 2023.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Mistral 7b, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024.
- Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp. arXiv preprint arXiv:2210.10683, 2022.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arxiv 2023. arXiv preprint arXiv:2305.13860, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Alex Albert. Jailbreak chat. https://www.jailbreakchat.com/, 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. arXiv preprint arXiv:2403.17336, 2024.
- Concept-based explainable artificial intelligence: A survey. arXiv preprint arXiv:2312.12936, 2023.
- From neural activations to concepts: A survey on explaining concepts in neural networks. arXiv preprint arXiv:2310.11884, 2023.
- Explain any concept: Segment anything meets concept-based explanation. Advances in Neural Information Processing Systems, 36, 2024.
- Improving generalizability in implicitly abusive language detection with concept activation vectors. arXiv preprint arXiv:2204.02261, 2022.
- Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, 2023.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Zhihao Xu (53 papers)
- Ruixuan Huang (4 papers)
- Xiting Wang (42 papers)
- Changyu Chen (19 papers)