Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation (2310.06987v1)

Published 10 Oct 2023 in cs.CL, cs.AI, and cs.CR

Abstract: The rapid progress in open-source LLMs is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 LLMs including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM.

PDF HTML Abstract

The paper "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation" (Huang et al., 2023 ) introduces a novel attack methodology termed "generation exploitation" that circumvents the safety alignment of open-source LLMs by manipulating the decoding process rather than crafting adversarial input prompts. This approach demonstrates significant effectiveness against a range of contemporary models, highlighting vulnerabilities in standard safety evaluation protocols that typically rely on fixed generation parameters. The work also proposes a mitigation strategy based on incorporating diverse generation outputs into the alignment fine-tuning process.

Generation Exploitation Attack Methodology

The core premise of the generation exploitation attack is that LLM alignment, often achieved through methods like Reinforcement Learning from Human Feedback (RLHF) and evaluated under default decoding configurations, may not generalize robustly across different generation strategies. The attack systematically explores variations in the generation process to elicit misaligned or harmful content in response to standard malicious prompts (e.g., from the AdvBench dataset).

The attack comprises several key components:

System Prompt Manipulation: System prompts, prepended instructions designed to enforce safety constraints (e.g., "You are a helpful and harmless AI assistant."), are often used during inference. The attack evaluates scenarios both with and without these prompts, observing that their removal frequently increases the Attack Success Rate (ASR), even for models purportedly trained to internalize system prompt guidance.
Decoding Strategy Variation: Instead of relying on default parameters (e.g., top-p=0.9, temperature=0.1 often used for LLaMA2 evaluation), the attack explores a diverse set of decoding configurations:
- Temperature Sampling: Modifies the temperature parameter (τ), which controls the randomness of the probability distribution over the vocabulary. Lower temperatures sharpen the distribution, favoring high-probability tokens, while higher temperatures flatten it, increasing diversity. The study tested τ values ranging from 0.05 to 1.0. The probability of selecting token $i$ is given by $P(token_i) = \frac{\exp(logit_i / \tau)}{\sum_j \exp(logit_j / \tau)}$ .
- Top-K Sampling: Limits the sampling pool to the K tokens with the highest probabilities. Tested values for K included {1, 2, 5, ..., 500}.
- Top-p (Nucleus) Sampling: Selects the smallest set of tokens whose cumulative probability mass exceeds a threshold p. Tested p values ranged from 0.05 to 1.0.
Boosting Techniques: To maximize the likelihood of generating misaligned content, especially for strongly aligned models like LLaMA2-chat variants, two boosting strategies are employed:
- Multiple Sampling: For a single chosen decoding configuration, multiple independent output sequences are generated. A scorer model (a classifier trained to distinguish aligned vs. misaligned responses) then selects the most misaligned output among the candidates.
- Decoding Constraints: Manipulates the generation process by applying penalties or explicit constraints. This includes length penalties and enforcing or forbidding specific words (e.g., penalizing refusals like "sorry," "cannot"; enforcing agreement words like "sure," "okay").

The overall attack procedure involves iterating through various combinations of system prompt presence/absence and decoding parameters (τ, K, p), potentially using boosting techniques. The final output is selected by the scorer as the most misaligned response generated across all tested configurations for a given input prompt.

def generation_exploitation_attack(prompt, model, tokenizer, scorer, config_space, num_samples_per_config=5):
    """
    Applies the generation exploitation attack.

    Args:
        prompt (str): The malicious input prompt.
        model: The target LLM.
        tokenizer: The model's tokenizer.
        scorer: A classifier to score misalignment.
        config_space (list): List of decoding configurations (temp, top_k, top_p, use_system_prompt).
        num_samples_per_config (int): Number of samples for boosting.

    Returns:
        str: The most misaligned response found.
    """
    best_response = ""
    max_misalignment_score = -1.0

    for config in config_space:
        # Apply system prompt setting if specified in config
        effective_prompt = setup_prompt(prompt, config['use_system_prompt'])
        inputs = tokenizer(effective_prompt, return_tensors="pt").to(model.device)

        generated_responses = []
        for _ in range(num_samples_per_config):
            # Generate response using current config's decoding parameters
            output_ids = model.generate(
                inputs.input_ids,
                temperature=config.get('temperature', 0.7),
                top_k=config.get('top_k', 50),
                top_p=config.get('top_p', 0.9),
                do_sample=True,
                max_new_tokens=512,
                # Apply decoding constraints if specified
                # ... (e.g., bad_words_ids, force_words_ids, length_penalty)
            )
            response_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
            generated_responses.append(response_text)

        # Score generated responses
        for response in generated_responses:
            score = scorer.predict_misalignment(response)
            if score > max_misalignment_score:
                max_misalignment_score = score
                best_response = response

    return best_response

Experimental Results and Evaluation

The study evaluated the generation exploitation attack across 11 open-source LLMs, including variants of LLaMA2 (7B, 13B, 70B - base and chat), Vicuna (7B, 13B, 33B), Falcon (7B, 40B - base and instruct), and MPT (7B, 30B - base and instruct), using the AdvBench dataset of harmful instructions.

Attack Success Rate (ASR): The primary metric was ASR, measured using a trained RoBERTa-large classifier validated against human judgments (92% agreement).
- For models evaluated under their default generation settings (often aligned), the ASR was typically near 0%, particularly for chat/instruct variants like LLaMA2-chat.
- Simply applying the generation exploitation attack (varying decoding strategies, without system prompt) increased the ASR to over 95% for 9 out of the 11 models.
- With the inclusion of boosting techniques (multiple sampling, decoding constraints), the ASR exceeded 95% for all 11 tested models, including the LLaMA2-chat models specifically fine-tuned for safety. This represents a catastrophic failure of alignment under manipulated generation conditions.
- Different models exhibited peak vulnerability under different decoding configurations, reinforcing the inadequacy of single-configuration safety testing.
Harmfulness Percentage (HP): To assess the practical severity of the jailbreaks, human evaluation was performed on a subset of outputs classified as misaligned by the scorer. For LLaMA2-13B-chat, approximately 50% of the machine-identified misaligned outputs were judged by humans to contain actionable harmful instructions (HP=50%).
Computational Cost: The generation exploitation attack was compared to gradient-based adversarial prompt optimization methods like Greedy Coordinate Gradient (GCG). The proposed attack achieved significantly higher ASR while requiring substantially less computation. On a single A100 GPU, attacking one prompt on LLaMA2-7B-chat took approximately 3 minutes using generation exploitation, compared to 1.5 hours for GCG, indicating a ~30x reduction in computational cost.

These results strongly suggest that current alignment techniques may primarily optimize for behavior under default inference settings, leaving models highly vulnerable when those settings are altered.

Generation-aware Alignment Mitigation

To address the identified vulnerability, the paper proposes a defense mechanism called Generation-aware Alignment. This method augments the standard alignment fine-tuning process by incorporating data generated under the diverse conditions exploited by the attack.

The process involves:

Data Collection: For a set of known malicious prompts, generate multiple responses using the target LLM across a wide range of decoding configurations (τ, K, p).
Labeling: Classify the collected responses as either "aligned" (e.g., safe refusals) or "misaligned" (harmful content) using a pre-trained scorer or human annotation.
Fine-tuning: Fine-tune the target LLM on this diverse dataset. The paper adapts an objective inspired by "chain of hindsight," where the model is trained to predict the correct continuation based on a prefix indicating the desired alignment status. For instance, aligned responses are associated with a prefix like "An aligned answer:", while misaligned responses are associated with "A misaligned answer:". The training objective encourages the generation of text following the "aligned" prefix while implicitly discouraging paths leading to misaligned content across various generation possibilities.

def prepare_generation_aware_data(prompts, model, tokenizer, scorer, config_space):
    """
    Generates training data for generation-aware alignment.

    Args:
        prompts (list): List of malicious prompts.
        model: The LLM to be aligned.
        tokenizer: The model's tokenizer.
        scorer: Classifier for alignment status.
        config_space (list): Diverse decoding configurations.

    Returns:
        list: Fine-tuning dataset [(prompt, response, label), ...].
    """
    alignment_data = []
    for prompt in prompts:
        for config in config_space:
            # Generate multiple responses for the prompt+config pair
            # (Simplified: generate one response per config here)
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            output_ids = model.generate(
                inputs.input_ids,
                temperature=config.get('temperature', 0.7),
                top_k=config.get('top_k', 50),
                top_p=config.get('top_p', 0.9),
                do_sample=True,
                max_new_tokens=512,
            )
            response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

            # Label the response
            is_misaligned = scorer.predict_misalignment(response) > 0.5 # Threshold
            label = "misaligned" if is_misaligned else "aligned"

            # Format for fine-tuning (example using hindsight prefix)
            if label == "aligned":
                formatted_prompt = prompt + "\nAn aligned answer:"
            else:
                formatted_prompt = prompt + "\nA misaligned answer:"
            alignment_data.append({"prompt": formatted_prompt, "response": response}) # Or other SFT formats

    return alignment_data

Experimental validation on LLaMA2-7B-chat demonstrated the effectiveness of this approach. Generation-aware alignment reduced the ASR under the generation exploitation attack from 95% down to 69%. This improvement was significantly better than a control alignment strategy that only used training examples generated with a fixed, default decoding setting, which only reduced the ASR to 88%.

Implications for Safety Evaluation and Alignment

The findings carry significant implications for the development and deployment of open-source LLMs:

Inadequacy of Current Safety Evaluations: Standard safety evaluations, often conducted using fixed default decoding parameters, provide an incomplete and potentially misleading assessment of model robustness. They fail to capture vulnerabilities exploitable by varying the generation process.
Need for Comprehensive Red Teaming: Effective red teaming must explore the impact of diverse decoding strategies, system prompt variations, and other generation-time manipulations, in addition to adversarial prompt crafting.
Alignment Generalization: Alignment procedures need to ensure robustness not just to input perturbations but also to variations in the output generation process. Generation-aware alignment offers one potential direction for achieving this.

The paper underscores a critical gap in the standard practices for ensuring LLM safety, advocating for a shift towards more holistic evaluation and alignment methodologies that explicitly consider the influence of generation parameters.

In conclusion, the research demonstrates that manipulating LLM generation strategies constitutes a potent and computationally efficient jailbreak vector, capable of inducing high rates of misalignment even in safety-aligned models. This highlights the necessity for evaluating and aligning models under diverse generation conditions, with generation-aware alignment presented as a viable mitigation technique.