Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Defending LLMs against Jailbreaking Attacks via Backtranslation (2402.16459v3)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: Although many LLMs have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a LLM to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/LLM-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  7. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  8. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  9. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
  10. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  11. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  12. OpenAI. 2023. Chatgpt. https://openai.com/blog/chatgpt/. Accessed on May 3, 2023.
  13. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  14. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  15. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  16. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  17. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  18. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11.
  19. An llm can fool itself: A prompt-based adversarial attack. arXiv preprint arXiv:2310.13345.
  20. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  21. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373.
  22. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
  23. Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018.
  24. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  25. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263.
  26. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.
  27. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
  28. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (16)

Summary

  • The paper introduces a novel backtranslation mechanism that recovers harmful intent from adversarial prompts to counteract jailbreaking attacks.
  • It employs a two-step process that translates the model’s output to infer a cleaner prompt, leveraging the model’s intrinsic refusal capabilities.
  • Empirical results demonstrate that the method outperforms state-of-the-art baselines while maintaining high generation quality on benign inputs.

Enhancing LLM Safety against Jailbreaking Attacks with Backtranslation

Introduction

The proliferation of LLMs has highlighted the significant issue of safety in handling harmful requests. While recent efforts have focused on aligning these models with human intentions and values by training them to refuse unethical or illegal content, vulnerabilities still persist. Jailbreaking attacks, designed to circumvent these safety mechanisms with adversarially crafted prompts, present a notable challenge. This paper introduces a novel and lightweight defense methodology leveraging backtranslation to mitigate such attacks, demonstrating its effectiveness and efficiency across various LLM applications.

Defense Mechanism

Overview of the Strategy

Our methodology utilizes backtranslation in a two-step process to recover the harmful intent obscured within an adversarially crafted prompt. Initially responding to the adversarial prompt, the model's output serves as a basis for backtranslation, enabling the inference of a cleaner, backtranslated prompt. This inferred prompt, less susceptible to attacker manipulation, is then re-evaluated by the model to decide on the refusal of the original prompt. This approach benefits from operating on the model’s response, leveraging the model’s inherent refusal capabilities without additional training or making the defense susceptible to direct adversarial manipulation.

Implementation Details

The implementation involves a secondary model tasked with the backtranslation, generating a prompt based on the original response. Our defense checks for the refusal of this backtranslated prompt by the target LLM, using this as a criterion to refuse the original prompt if necessary. Provided scripts and detailed algorithmic steps are made available for researchers and practitioners to apply and further investigate this method.

Comparative Analysis

Defense Success Rates

Empirical comparisons demonstrate the proposed defense's superiority against state-of-the-art baselines, particularly in scenarios where conventional defenses falter. Notably, our method achieves significant defense success rates across multiple models and attack types, showcasing its broad applicability and robustness against both known and novel attack strategies.

Impact on Generation Quality

A critical evaluation of the defense's impact on benign input prompts reveals minimal degradation in generation quality, affirming the methodology's practicality for real-world applications. This indicates a valuable balance between maintaining operational integrity and enhancing security against adversarial attacks.

Future Directions

Theoretical and Practical Implications

The introduction of backtranslation as a defense mechanism opens new avenues for both theoretical exploration and practical enhancement of LLM safety. Future research could explore the refinement of backtranslation models and thresholds to mitigate over-refusal and further improve response quality.

Potential for Broader Application

While the present paper focuses on defending against jailbreaking attacks, the underlying principles and methodologies offer a foundation for addressing a wider range of adversarial threats to LLMs. The adaptability and efficiency of our approach encourage its exploration in other contexts where LLM safety is compromised.

Conclusion

The proposed backtranslation-based defense mechanism marks a significant step towards mitigating the vulnerability of LLMs to jailbreaking attacks. By leveraging the model's inherent refusal capability in a novel manner, this method offers a robust and efficient means of enhancing LLM safety without compromising generation quality. The findings underscore the importance of ongoing research and development in the field of artificial intelligence safety, paving the way for more secure and reliable LLM deployments.