Does Refusal Training in LLMs Generalize to the Past Tense? (2407.11969v3)

Published 16 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g., "How to make a Molotov cocktail?" to "How did people make a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future tense are less effective, suggesting that refusal guardrails tend to consider past historical questions more benign than hypothetical future questions. Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. Overall, our findings highlight that the widely used alignment techniques -- such as SFT, RLHF, and adversarial training -- employed to align the studied models can be brittle and do not always generalize as intended. We provide code and jailbreak artifacts at https://github.com/tml-epfl/LLM-past-tense.

PDF HTML Abstract

Analysis of Refusal Training Vulnerabilities in LLMs

This essay examines the significant findings and methodological contributions presented in the paper titled "Identifying and Addressing Vulnerabilities in Refusal Training for LLMs." The paper sheds light on critical vulnerabilities within the refusal training protocols of state-of-the-art LLMs and proposes straightforward yet effective methodologies to enhance their robustness.

Generalization Gap in Current Refusal Training

The primary observation of the paper is the discovery of a generalization gap in refusal training protocols. The researchers reveal that simply rephrasing harmful requests in the past tense ("How to make a Molotov cocktail?" to "How did people make a Molotov cocktail?") can effectively bypass refusal filters in many leading LLMs, including GPT-4o, Llama-3 8B, and others. For example, the attack success rate on GPT-4o increased dramatically from 1% for direct requests to 88% with past tense reformulations.

This highlights a critical blind spot within current supervised fine-tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and other alignment techniques. Interestingly, the paper also finds that future tense reformulations are less effective, indicating a possible bias in the models' training data or their inherent reasoning processes.

Methodology

Reformulation Strategy

The core methodology involves using GPT-3.5 Turbo to automatically reformulate harmful requests into the past tense. This process leverages the inherent variability of LLMs' outputs and attempts multiple reformulations per query. The evaluations reveal that this simplistic attack achieves notably high success rates across multiple state-of-the-art LLMs, such as an 82% success rate against Phi-3-Mini and a 98% rate against R2D2.

The paper employs the GPT-4 model as a semantic judge to assess the efficacy of these reformulations. Additionally, the researchers utilize Llama-3 70B and a rule-based heuristic as comparative standards.

Fine-Tuning Adjustments

To assess countermeasures, the paper fine-tunes GPT-3.5 Turbo with past tense examples included explicitly in the training dataset. Results indicate that integrating even a small proportion (2%-30%) of past tense refusal data can drastically reduce the attack success rate, proving that incorporating specific refusal examples during fine-tuning can enhance LLMs' robustness against simple reformulations.

Implications and Future Directions

The findings underscore the brittleness of current alignment methods, pointing out that while these methods generalize across various languages and some input encodings, they struggle significantly with tense variations. This exposes potential risks where LLMs may fail to refuse harmful requests appropriately, leading to unsafe and unethical outputs.

Practical and Theoretical Implications

From a practical perspective, the paper suggests refining refusal training by systematically including diverse tense formulations of harmful requests in the training data. This approach can mitigate vulnerabilities and improve the models' safety and reliability.

Theoretically, the research calls into question the generalization capabilities of SFT, RLHF, and DPO methods. There is an imperative need to explore the factors determining these generalized learning failures and to explore new alignment techniques that can address such blind spots effectively.

Broader Context and Future Research

The broader context of this research highlights a significant area for improvement in the development and deployment of LLMs. It advocates for ongoing scrutiny and auditing of LLM alignment methods, suggesting that simple yet overlooked reformulations can unveil substantial vulnerabilities.

Future research should investigate additional blind spots in LLM alignments, such as those influenced by cultural contexts, demographic nuances, and different narrative structures. Enhancing internal representation mechanisms that accurately capture harmful content across various contexts could be a pivotal step in refining LMM safety protocols.

Additionally, the paper's emphasis on straightforward attacks to test LLMs' generalization aligns with the broader goal of developing more resilient AI systems. There is a compelling argument for integrating such testing methodologies into the standard evaluation protocols of LLMs.

In conclusion, the paper provides critical insights into the limitations of current refusal training techniques and presents pragmatic solutions to enhance LLM robustness. The research opens new avenues for exploring the generalization limits of AI alignment techniques, ultimately contributing to the development of safer and more reliable LLMs.