Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique (2408.10701v1)

Published 20 Aug 2024 in cs.CL

Abstract: In today's era, where LLMs are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces FERRET, an efficient framework that reduces time to 90% Attack Success Rate by 15.2% compared to previous methods.
It employs multiple adversarial prompt mutations with categorical filtering and reward-based scoring to enhance prompt diversity and risk targeting.
FERRET demonstrates robust transferability with ASRs of 95% on Llama 2-chat 7B and 94% on Llama 3-Instruct 8B, ensuring wide applicability across LLMs.

FERRET: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

The paper presents FERRET, a novel framework for automated red teaming, aimed at enhancing the capability to uncover vulnerabilities in LLMs. This work builds upon prior methodologies, notably RAINBOW TEAMING, which tackled adversarial prompt generation as a quality-diversity search issue. Despite its contributions, RAINBOW TEAMING struggled with slow convergence and high resource demands. FERRET addresses these shortcomings by refining both the efficiency and diversity of generated adversarial prompts.

Methodological Innovations

FERRET introduces multiple adversarial prompt mutations per iteration, combined with a sophisticated scoring mechanism to evaluate these mutations. The methodology centers on four key steps:

Sampling: FERRET selects weak prompts from an initialized archive, which is structured along two dimensions—risk categories and attack styles.
Mutation: Leveraging a mutator model, it generates a suite of candidate prompts by incorporating specified risk categories and attack styles.
Categorical Filtering: This step filters out mutations that do not align with predefined feature descriptors, ensuring adherence to target risk categories.
Scoring: Utilizing innovative scoring functions including reward models and LLM-based evaluation mechanisms, the most harmful and diverse prompts are identified and added to the archive.

The integration of a reward-based scoring function significantly enhances performance, achieving a notable reduction in the time needed to reach a 90% Attack Success Rate (ASR) by 15.2% over baseline methods. Moreover, FERRET demonstrates transferable adversarial prompts that are effective across different LLMs like Llama 2-chat 7B and Llama 3-Instruct 8B.

Experimental Results

The empirical analysis substantiates the effectiveness of FERRET. Specifically, the framework was tested across multiple risk categories using two safety classifiers, Llama Guard 2 and GPT-4, to measure ASR. Key findings include:

FERRET consistently outperformed the baseline RAINBOW TEAMING with category filtering, achieving ASRs of 95% on Llama 2-chat 7B and 94% on Llama 3-Instruct 8B.
The reward-based scoring function showed superior alignment between Llama Guard 2 and GPT-4 evaluations compared to other scoring variants, providing balanced and robust adversarial performance.
FERRET also exhibited improved efficiency, reducing training time significantly across various ASR thresholds and displaying an advantageous transferability of adversarial prompts to larger models.

Discussion and Implications

The introduction of FERRET marks a significant development in automated red teaming, offering both theoretical and practical advancements. The ability to generate diverse and transferable adversarial prompts efficiently holds promise for improving the robustness and safety of LLMs. By reducing computational costs and time-to-solution, FERRET facilitates more frequent and thorough vulnerability assessments.

Future research could explore scaling these methodologies for even larger LLMs and expanding the feature descriptors to encompass a broader array of risk categories. Additionally, refining mutator models to preserve the semantics of original prompts while achieving diverse mutations could further optimize automated adversarial attack frameworks.

In conclusion, FERRET exemplifies a sophisticated integration of quality-diversity search with reward-based evaluation, standing as a robust, efficient, and effective approach to automated red teaming within the landscape of AI safety and ethical deployments.

PDF Markdown

Related Papers

GitHub

GitHub - declare-lab/ferret: Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique (10 stars)

Tweets

https://twitter.com/soujanyaporia/status/1828949286870262018

https://twitter.com/arXivGPT/status/1826712461371666807

https://twitter.com/GptMaestro/status/1827924302383267956