Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreaker in Jail: Moving Target Defense for Large Language Models (2310.02417v1)

Published 3 Oct 2023 in cs.CR

Abstract: LLMs, known for their capability in understanding and following instructions, are vulnerable to adversarial attacks. Researchers have found that current commercial LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers when faced with adversarial queries. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system. The system aims to deliver non-toxic answers that align with outputs from multiple model candidates, making them more robust against adversarial attacks. We design a query and output analysis model to filter out unsafe or non-responsive answers. %to achieve the two objectives of randomly selecting outputs from different LLMs. We evaluate over 8 most recent chatbot models with state-of-the-art adversarial queries. Our MTD-enhanced LLM system reduces the attack success rate from 37.5\% to 0\%. Meanwhile, it decreases the response refusal rate from 50\% to 0\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (9)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
  2. Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots. arXiv preprint arXiv:2307.09579 (2023).
  3. On Randomization in MTD Systems. In Proceedings of the 9th ACM Workshop on Moving Target Defense. 37–43.
  4. Hardware Moving Target Defenses against Physical Attacks: Design Challenges and Opportunities. In Proceedings of the 9th ACM Workshop on Moving Target Defense.
  5. Laiyer.ai. 2023. LLM Guard - The Security Toolkit for LLM Interactions. https://github.com/laiyer-ai/llm-guard.git.
  6. OpenAI. 2023. ChatGPT. chat.openai.com/. Accessed 16 Feb. 2023..
  7. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  8. Jailbroken: How Does LLM Safety Training Fail? arXiv preprint arXiv:2307.02483 (2023).
  9. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Bocheng Chen (10 papers)
  2. Advait Paliwal (1 paper)
  3. Qiben Yan (40 papers)
Citations (13)
Reddit Logo Streamline Icon: https://streamlinehq.com