FLRT: Fluent Student-Teacher Redteaming (2407.17447v2)

Published 24 Jul 2024 in cs.CL and cs.AI

Abstract: Many publicly available LLMs have been safety tuned to reduce the likelihood of toxic or liability-inducing text. To redteam or jailbreak these models for compliance with toxic requests, users and security analysts have developed adversarial prompting techniques. One attack method is to apply discrete optimization techniques to the prompt. However, the resulting attack strings are often gibberish text, easily filtered by defenders due to high measured perplexity, and may fail for unseen tasks and/or well-tuned models. In this work, we improve existing algorithms (primarily GCG and BEAST) to develop powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune, either in terms of output probabilities or internal activations. To encourage human-fluent attacks, we add a multi-model perplexity penalty and a repetition penalty to the objective. We also enhance optimizer strength by allowing token insertions, token swaps, and token deletions and by using longer attack sequences. The resulting process is able to reliably jailbreak the most difficult target models with prompts that appear similar to human-written prompts. On Advbench we achieve attack success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity $<33$; we achieve $95$% attack success for Phi-3, though with higher perplexity. We also find a universally-optimized single fluent prompt that induces $>88$% compliance on previously unseen tasks across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box models.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel distillation-based adversarial attack using toxified teacher models, achieving over 93% ASR on leading language models.
The paper employs human-fluency regularization with multi-model perplexity and repetition penalties to ensure adversarial prompts appear natural.
The paper demonstrates enhanced token optimization techniques that outperform prior methods, underscoring persistent vulnerabilities in AI safety systems.

Fluent Student-Teacher Redteaming: An In-Depth Analysis

The paper "Fluent student-teacher redteaming" by T. Ben Thompson and Michael Sklar investigates advanced techniques for adversarially attacking safety-tuned LLMs, such as Llama-2 and Phi-3. These models have been mitigated to avoid generating toxic or harmful content using methods like reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, these safety measures are not impervious, and the paper proposes new adversarial strategies that yield both effective and human-fluent attacks.

Core Contributions and Methodology

The primary contributions of this work can be summarized as follows:

Distillation-Based Adversarial Attacks:
- The paper introduces an advanced distillation approach, using two main methodologies: logits-based distillation and hint-based (internal activations) distillation. Instead of simply maximizing the likelihood of an affirmative response, the proposed method fine-tunes a toxified version of the model to serve as a teacher model. This toxified model guides the adversarial prompt to emulate the toxic behavior effectively.
Human-Fluency Regularization:
- To ensure that the adversarial prompts appear human-like, the authors add a multi-model perplexity penalty and a repetition penalty. The fluency is measured against several models to avoid optimization loopholes, where the prompt might look fluent to one model but nonsensical to a human.
Enhanced Token Optimizations:
- A more flexible optimization strategy is employed, allowing token insertions, deletions, and swaps while also adapting the length of the attack sequences. This approach integrates techniques from multiple prior works like GCG (Greedy Coordinate Gradient) and BEAST (Best-Effort Adversarial Sentence generation Technique).

Results and Performance Metrics

The effectiveness of these methods is demonstrated through evaluations on AdvBench, a benchmark for adversarial robustness. The paper reports:

Attack Success Rate (ASR):
- The authors achieve over 93% ASR for Llama-2-7B, Llama-3-8B, and Vicuna-7B models. For Phi-3, the ASR reaches 95%, albeit with higher perplexity.
Universal Prompt Compliance:
- A single optimized prompt reaches an ASR of over 88% on previously unseen tasks across various models, including Llama-2-7B, Phi-3-mini, and Vicuna-7B. This result underscores the generalizability of the approach.

Comparative Analysis

In comparisons with previous methods, such as AutoDAN, BEAST, and COLD-Attack, the proposed techniques yield superior results both in terms of ASR and keeping the perplexity low. For example, the paper reports achieving an ASR of 96% with a perplexity of 32.9 for Llama-2-7B, significantly outperforming previous methods like COLD-Attack which had an ASR of 67.1% with a perplexity of 26.5.

Practical and Theoretical Implications

The practical implications of this work are extensive:

Adversarial Robustness:
- The results demonstrate that current safety-tuned models remain vulnerable to sophisticated attacks. Thus, continuous advancements in adversarial defenses are crucial.
Red-Teaming Techniques:
- The refined red-teaming techniques presented could become vital tools for developers to identify and patch vulnerabilities.
Ethical Considerations:
- The ethical application of these adversarial methods should be closely monitored, ensuring they are used to improve model safety and not to exploit or harm.

Future Directions

The paper outlines several potential areas for future research and development:

Optimizing Computational Efficiency:
- Reducing the computational demands of the adversarial process, such as by better-tuning stopping rules and hyperparameters, could make these techniques more accessible and scalable.
Enhancing Human-Fluency Proxies:
- Developing improved proxy objectives to better approximate human judgment during fluency regularization is essential for creating genuinely human-like adversarial prompts.
Long-Sequence Optimization:
- Expanding token-based optimization to handle sequences longer than 10,000 tokens could open new avenues for more robust and comprehensive adversarial attacks.

Conclusion

The paper "Fluent student-teacher redteaming" provides a robust framework for generating human-fluent adversarial attacks on several state-of-the-art LLMs. By combining distillation techniques with advanced optimization strategies, the authors effectively highlight the persistent vulnerabilities in these models' safety mechanisms. These contributions are indispensable for the ongoing effort to enhance model robustness and ensure safer deployments of AI systems. The proposed methods not only advance the field of adversarial machine learning but also underscore the necessity for continuous innovation in AI safety research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tbenthompson/status/1816532156031643714

https://twitter.com/michaelbsklar/status/1882809145105281307

https://twitter.com/GptMaestro/status/1817085600761369050

HackerNews

Fluent Student-Teacher Redteaming (1 point, 0 comments)