How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (2402.15302v5)
Abstract: In this study, we tackle a growing concern around the safety and ethical use of LLMs. Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and instruction-centric responses. For evaluation we report the harmfulness score metric as well as judgements from GPT-4 and humans. Overall, we observe that asking LLMs to produce instruction-centric responses enhances the unethical response generation by ~2-38% across the models. As an additional objective, we investigate the impact of model editing using the ROME technique, which further increases the propensity for generating undesirable content. In particular, asking edited LLMs to generate instruction-centric responses further increases the unethical response generation by ~3-16% across the different models.
- On the opportunities and risks of foundation models, 2022.
- Leveraging the context through multi-round interactions for jailbreaking attacks, 2024.
- Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2176–2189, Singapore, December 2023. Association for Computational Linguistics.
- Multilingual jailbreak challenges in large language models, 2023.
- Specializing smaller language models towards multi-step reasoning, 2023.
- Julian Hazell. Spear phishing with large language models, 2023.
- Sowing the wind, reaping the whirlwind: The impact of editing language models, 2024.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Bias testing and mitigation in llm-based code generation, 2024.
- Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024.
- Large language models are zero-shot reasoners, 2023.
- Open sesame! universal black box jailbreaking of large language models, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods, 2022.
- Tree of attacks: Jailbreaking black-box llms automatically, 2023.
- Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
- Language model inversion, 2023.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
- Universal jailbreak backdoors from poisoned human feedback, 2024.
- Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition, 2023.
- Proximal policy optimization algorithms, 2017.
- Scalable and transferable black-box jailbreaks for language models via persona modulation, 2023.
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, Toronto, Canada, July 2023. Association for Computational Linguistics.
- ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
- The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024.
- On the exploitability of instruction tuning, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Poisoning language models during instruction tuning, 2023.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024.
- Jailbroken: How does llm safety training fail?, 2023.
- Jailbreak and guard aligned language models with only few in-context demonstrations, 2023.
- Fundamental limitations of alignment in large language models, 2024.
- Cognitive overload: Jailbreaking large language models with overloaded logical thinking, 2023.
- Backdooring instruction-tuned large language models with virtual prompt injection, 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
- Weak-to-strong jailbreaking on large language models, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- AutoDAN: Automatic and interpretable adversarial attacks on large language models, 2024.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Somnath Banerjee (22 papers)
- Sayan Layek (11 papers)
- Rima Hazra (21 papers)
- Animesh Mukherjee (154 papers)