Cross-Task Defense: Instruction-Tuning LLMs for Content Safety (2405.15202v1)
Abstract: Recent studies reveal that LLMs face challenges in balancing safety with utility, particularly when processing long texts for NLP tasks like summarization and translation. Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear. Our work aims to develop robust defenses for LLMs in processing malicious documents alongside benign NLP task queries. We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning. Our empirical results demonstrate that LLMs can significantly enhance their capacity to safely manage dangerous content with appropriate instruction tuning. Additionally, strengthening the defenses of tasks most susceptible to misuse is effective in protecting LLMs against processing harmful information. We also observe that trade-offs between utility and safety exist in defense strategies, where Llama2, utilizing our proposed approach, displays a significantly better balance compared to Llama1.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.
- Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy and Security, 6(1):1–52.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.
- Play guessing game with llm: Indirect jailbreak attack with implicit clues.
- Palm: Scaling language modeling with pathways.
- Safety alignment in nlp tasks: Weakly aligned summarization as an in-context attack.
- Lora: Low-rank adaptation of large language models.
- Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes.
- Catastrophic jailbreak of open-source llms via exploiting generation. In The Twelfth International Conference on Learning Representations.
- Baseline defenses for adversarial attacks against aligned language models.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
- Mistral 7b.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Smoothllm: Defending large language models against jailbreaking attacks.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Finetuned language models are zero-shot learners.
- Lilian Weng. 2023. Adversarial attacks on llms. lilianweng.github.io.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.
- Autodan: Interpretable gradient-based adversarial attacks on large language models.
- Universal and transferable adversarial attacks on aligned language models.
- Yu Fu (86 papers)
- Wen Xiao (32 papers)
- Jia Chen (85 papers)
- Jiachen Li (144 papers)
- Evangelos Papalexakis (7 papers)
- Aichi Chien (3 papers)
- Yue Dong (61 papers)