- The paper introduces FERRET, an efficient framework that reduces time to 90% Attack Success Rate by 15.2% compared to previous methods.
- It employs multiple adversarial prompt mutations with categorical filtering and reward-based scoring to enhance prompt diversity and risk targeting.
- FERRET demonstrates robust transferability with ASRs of 95% on Llama 2-chat 7B and 94% on Llama 3-Instruct 8B, ensuring wide applicability across LLMs.
FERRET: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
The paper presents FERRET, a novel framework for automated red teaming, aimed at enhancing the capability to uncover vulnerabilities in LLMs. This work builds upon prior methodologies, notably RAINBOW TEAMING, which tackled adversarial prompt generation as a quality-diversity search issue. Despite its contributions, RAINBOW TEAMING struggled with slow convergence and high resource demands. FERRET addresses these shortcomings by refining both the efficiency and diversity of generated adversarial prompts.
Methodological Innovations
FERRET introduces multiple adversarial prompt mutations per iteration, combined with a sophisticated scoring mechanism to evaluate these mutations. The methodology centers on four key steps:
- Sampling: FERRET selects weak prompts from an initialized archive, which is structured along two dimensions—risk categories and attack styles.
- Mutation: Leveraging a mutator model, it generates a suite of candidate prompts by incorporating specified risk categories and attack styles.
- Categorical Filtering: This step filters out mutations that do not align with predefined feature descriptors, ensuring adherence to target risk categories.
- Scoring: Utilizing innovative scoring functions including reward models and LLM-based evaluation mechanisms, the most harmful and diverse prompts are identified and added to the archive.
The integration of a reward-based scoring function significantly enhances performance, achieving a notable reduction in the time needed to reach a 90% Attack Success Rate (ASR) by 15.2% over baseline methods. Moreover, FERRET demonstrates transferable adversarial prompts that are effective across different LLMs like Llama 2-chat 7B and Llama 3-Instruct 8B.
Experimental Results
The empirical analysis substantiates the effectiveness of FERRET. Specifically, the framework was tested across multiple risk categories using two safety classifiers, Llama Guard 2 and GPT-4, to measure ASR. Key findings include:
- FERRET consistently outperformed the baseline RAINBOW TEAMING with category filtering, achieving ASRs of 95% on Llama 2-chat 7B and 94% on Llama 3-Instruct 8B.
- The reward-based scoring function showed superior alignment between Llama Guard 2 and GPT-4 evaluations compared to other scoring variants, providing balanced and robust adversarial performance.
- FERRET also exhibited improved efficiency, reducing training time significantly across various ASR thresholds and displaying an advantageous transferability of adversarial prompts to larger models.
Discussion and Implications
The introduction of FERRET marks a significant development in automated red teaming, offering both theoretical and practical advancements. The ability to generate diverse and transferable adversarial prompts efficiently holds promise for improving the robustness and safety of LLMs. By reducing computational costs and time-to-solution, FERRET facilitates more frequent and thorough vulnerability assessments.
Future research could explore scaling these methodologies for even larger LLMs and expanding the feature descriptors to encompass a broader array of risk categories. Additionally, refining mutator models to preserve the semantics of original prompts while achieving diverse mutations could further optimize automated adversarial attack frameworks.
In conclusion, FERRET exemplifies a sophisticated integration of quality-diversity search with reward-based evaluation, standing as a robust, efficient, and effective approach to automated red teaming within the landscape of AI safety and ethical deployments.