Defending LLMs against Jailbreaking Attacks via Backtranslation (2402.16459v3)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: Although many LLMs have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a LLM to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/LLM-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.

References (28)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a novel backtranslation mechanism that recovers harmful intent from adversarial prompts to counteract jailbreaking attacks.
It employs a two-step process that translates the model’s output to infer a cleaner prompt, leveraging the model’s intrinsic refusal capabilities.
Empirical results demonstrate that the method outperforms state-of-the-art baselines while maintaining high generation quality on benign inputs.

Enhancing LLM Safety against Jailbreaking Attacks with Backtranslation

Introduction

The proliferation of LLMs has highlighted the significant issue of safety in handling harmful requests. While recent efforts have focused on aligning these models with human intentions and values by training them to refuse unethical or illegal content, vulnerabilities still persist. Jailbreaking attacks, designed to circumvent these safety mechanisms with adversarially crafted prompts, present a notable challenge. This paper introduces a novel and lightweight defense methodology leveraging backtranslation to mitigate such attacks, demonstrating its effectiveness and efficiency across various LLM applications.

Defense Mechanism

Overview of the Strategy

Our methodology utilizes backtranslation in a two-step process to recover the harmful intent obscured within an adversarially crafted prompt. Initially responding to the adversarial prompt, the model's output serves as a basis for backtranslation, enabling the inference of a cleaner, backtranslated prompt. This inferred prompt, less susceptible to attacker manipulation, is then re-evaluated by the model to decide on the refusal of the original prompt. This approach benefits from operating on the model’s response, leveraging the model’s inherent refusal capabilities without additional training or making the defense susceptible to direct adversarial manipulation.

Implementation Details

The implementation involves a secondary model tasked with the backtranslation, generating a prompt based on the original response. Our defense checks for the refusal of this backtranslated prompt by the target LLM, using this as a criterion to refuse the original prompt if necessary. Provided scripts and detailed algorithmic steps are made available for researchers and practitioners to apply and further investigate this method.

Comparative Analysis

Defense Success Rates

Empirical comparisons demonstrate the proposed defense's superiority against state-of-the-art baselines, particularly in scenarios where conventional defenses falter. Notably, our method achieves significant defense success rates across multiple models and attack types, showcasing its broad applicability and robustness against both known and novel attack strategies.

Impact on Generation Quality

A critical evaluation of the defense's impact on benign input prompts reveals minimal degradation in generation quality, affirming the methodology's practicality for real-world applications. This indicates a valuable balance between maintaining operational integrity and enhancing security against adversarial attacks.

Future Directions

Theoretical and Practical Implications

The introduction of backtranslation as a defense mechanism opens new avenues for both theoretical exploration and practical enhancement of LLM safety. Future research could explore the refinement of backtranslation models and thresholds to mitigate over-refusal and further improve response quality.

Potential for Broader Application

While the present paper focuses on defending against jailbreaking attacks, the underlying principles and methodologies offer a foundation for addressing a wider range of adversarial threats to LLMs. The adaptability and efficiency of our approach encourage its exploration in other contexts where LLM safety is compromised.

Conclusion

The proposed backtranslation-based defense mechanism marks a significant step towards mitigating the vulnerability of LLMs to jailbreaking attacks. By leveraging the model's inherent refusal capability in a novel manner, this method offers a robust and efficient means of enhancing LLM safety without compromising generation quality. The findings underscore the importance of ongoing research and development in the field of artificial intelligence safety, paving the way for more secure and reliable LLM deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhouxingshi/status/1763448278040117541

https://twitter.com/zhouxingshi/status/1853542250992799771

https://twitter.com/_StaticFlow_/status/1762497390190223472

https://twitter.com/bradneuberg/status/1762505621889561043

https://twitter.com/knishimae0531/status/1766414751180243194

https://twitter.com/dannyzhengHQ/status/1762870972003061797

HackerNews

Defending LLMs against Jailbreaking Attacks via Backtranslation (67 points, 48 comments)