RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search (2504.15047v1)

Published 21 Apr 2025 in cs.CL

Abstract: LLMs exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for LLMs. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

Summary

The paper introduces RainbowPlus, an evolutionary framework for adversarial prompt generation that optimizes both attack success and prompt diversity.
It employs a multi-element archive and Judge LLM-based fitness evaluation to efficiently generate high-quality adversarial prompts.
Experimental results demonstrate significant improvements in attack success rate and prompt diversity compared to state-of-the-art red-teaming methods.

This paper introduces RainbowPlus, a novel framework for red-teaming LLMs to identify vulnerabilities and generate adversarial prompts that elicit unsafe or biased outputs. The authors argue that existing red-teaming methods often suffer from scalability issues, high resource requirements, limited diversity in attack strategies, or reliance on manual intervention.

RainbowPlus tackles these limitations by framing adversarial prompt generation as an evolutionary quality-diversity (QD) search problem, building upon the MAP-Elites algorithm and the earlier Rainbow Teaming approach. The core idea is to simultaneously optimize for attack success (quality) and the variety of attack strategies (diversity).

Key Innovations of RainbowPlus:

Multi-element Archive: Unlike traditional MAP-Elites and Rainbow Teaming which store only the single best prompt per archive cell (defined by feature descriptors like 'Risk Category' and 'Attack Style'), RainbowPlus allows each cell to store a set of diverse, high-quality prompts that exceed a fitness threshold ( $\eta$ ). This preserves more potentially valuable adversarial prompts, enriching the exploration of the vulnerability landscape.
Multi-prompt Fitness Evaluation: Instead of pairwise comparisons used in Rainbow Teaming, RainbowPlus employs a Judge LLM ( $\pi_J$ ) to evaluate multiple candidate prompts generated by a Mutator LLM ( $\pi_M$ ) in parallel. The fitness of a prompt $x'$ is calculated as the probability that the Target LLM's ( $\pi_T$ ) response is deemed "unsafe" by the Judge LLM: $f(x') = P(\pi_J(\pi_T(x')) = \text{``unsafe''})$ . This probabilistic scoring enhances accuracy and computational efficiency.

The RainbowPlus Algorithm Pipeline:

The framework operates iteratively through five stages (visualized in Figure 1):

Prompt Sampling: Select a parent prompt from the archive.
Candidate Generation: Use a Mutator LLM ( $\pi_M$ ) with few-shot prompting to generate multiple candidate offspring prompts based on the parent and a target descriptor.
Diversity Filtering: Select a subset of behaviorally distinct candidates (using metrics like BLEU score) to promote diverse exploration.
Response Evaluation: Obtain responses from the Target LLM ( $\pi_T$ ) for the filtered candidates and calculate their fitness scores using the Judge LLM ( $\pi_J$ ).
Update: Add candidate prompts exceeding the fitness threshold ( $\eta$ ) to the corresponding cell in the multi-element archive.

Experimental Evaluation:

RainbowPlus was evaluated extensively against baseline methods:

Comparison vs. Rainbow:
- Setup: Tested on 6 benchmark datasets (DNA, AQA, HQA, CHQA, DQA, BeaT) against 4 open-source LLMs (Llama-3.1-8B-Instruct, Gemma-2-9b-it, Qwen2.5-7B-Instruct, Ministral-8B-Instruct-2410). Two RainbowPlus variants ( $\alpha$ : median fitness per cell, $\beta$ : max fitness per cell) were included to mimic Rainbow's single-prompt constraint.
- Metrics: Attack Success Rate (ASR) using Llama-Guard-3-8B as an independent judge, and Diverse-Score (1 - Self-BLEU).
- Results: RainbowPlus and its variants significantly outperformed the reimplemented Rainbow baseline in ASR (e.g., 86.20% absolute ASR gain for RainbowPlus- $\beta$ vs. Rainbow on Gemma-2-9b-it/DQA). RainbowPlus maintained comparable diversity (Diverse-Score $\approx 0.84$ ) while generating up to 100x more unique prompts. Runtime was variable, sometimes faster and sometimes slower than Rainbow depending on the target LLM. Visualizations confirmed broader exploration of the prompt space.
Comparison vs. State-of-the-Art (SOTA):
- Setup: Tested on the HarmBench dataset against 9 SOTA methods (GCG, Zero-Shot, PAIR, TAP, PAP, AutoDAN, AutoDAN-Turbo, Human Jailbreaks, Direct Request) across 12 LLMs (10 open-source 7B models, 2 closed-source: GPT-4o Mini, GPT-4.1 Nano). ASR calculation was adapted to match HarmBench standards.
- Results: RainbowPlus achieved the highest average ASR (81.1%) across all models, surpassing the strongest baseline AutoDAN-Turbo (77.2%) by 3.9%. It was notably faster, requiring ~1.45 hours compared to AutoDAN-Turbo's ~13.50 hours (excluding AutoDAN-T's training time), largely due to not needing a warm-up phase. It outperformed AutoDAN-Turbo on GPT-4o Mini but underperformed on GPT-4.1 Nano.

Contributions and Conclusion:

The main contributions are:

The RainbowPlus framework with its multi-element archive and multi-prompt fitness evaluation for efficient and diverse adversarial prompt generation.
Comprehensive empirical validation showing superiority over QD baselines and SOTA methods in ASR and prompt diversity/quantity, along with significant speed advantages over some methods.
An open-source implementation to facilitate reproducibility and further research in LLM safety.

The paper concludes that RainbowPlus offers a scalable and effective evolutionary QD approach for LLM red-teaming. While acknowledging limitations like the manual definition of archive dimensions and potential underperformance on highly robust models without a warm-up phase, the authors propose future work on automated descriptor selection, warm-up phase integration, and scaling to larger models. The work emphasizes responsible red-teaming practices and aims to contribute to building safer LLMs.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - knoveleng/rainbowplus: Official repo for paper: "RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search"

Tweets

https://twitter.com/JagersbergKnut/status/1937821268595920962

YouTube

Show All Videos