Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue (2402.17262v2)
Abstract: LLMs have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Leveraging large language models in conversational recommender systems. arXiv preprint arXiv:2305.07961.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- Vojtěch Hudeček and Ondřej Dušek. 2023. Are large language models all you need for task-oriented dialogue? In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 216–228.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
- Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
- Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Zhenhong Zhou (15 papers)
- Jiuyang Xiang (2 papers)
- Haopeng Chen (5 papers)
- Quan Liu (116 papers)
- Zherui Li (6 papers)
- Sen Su (25 papers)