- The paper introduces RainbowPlus, an evolutionary framework for adversarial prompt generation that optimizes both attack success and prompt diversity.
- It employs a multi-element archive and Judge LLM-based fitness evaluation to efficiently generate high-quality adversarial prompts.
- Experimental results demonstrate significant improvements in attack success rate and prompt diversity compared to state-of-the-art red-teaming methods.
This paper introduces RainbowPlus, a novel framework for red-teaming LLMs to identify vulnerabilities and generate adversarial prompts that elicit unsafe or biased outputs. The authors argue that existing red-teaming methods often suffer from scalability issues, high resource requirements, limited diversity in attack strategies, or reliance on manual intervention.
RainbowPlus tackles these limitations by framing adversarial prompt generation as an evolutionary quality-diversity (QD) search problem, building upon the MAP-Elites algorithm and the earlier Rainbow Teaming approach. The core idea is to simultaneously optimize for attack success (quality) and the variety of attack strategies (diversity).
Key Innovations of RainbowPlus:
- Multi-element Archive: Unlike traditional MAP-Elites and Rainbow Teaming which store only the single best prompt per archive cell (defined by feature descriptors like 'Risk Category' and 'Attack Style'), RainbowPlus allows each cell to store a set of diverse, high-quality prompts that exceed a fitness threshold (η). This preserves more potentially valuable adversarial prompts, enriching the exploration of the vulnerability landscape.
- Multi-prompt Fitness Evaluation: Instead of pairwise comparisons used in Rainbow Teaming, RainbowPlus employs a Judge LLM (πJ) to evaluate multiple candidate prompts generated by a Mutator LLM (πM) in parallel. The fitness of a prompt x′ is calculated as the probability that the Target LLM's (πT) response is deemed "unsafe" by the Judge LLM: f(x′)=P(πJ(πT(x′))=“unsafe”). This probabilistic scoring enhances accuracy and computational efficiency.
The RainbowPlus Algorithm Pipeline:
The framework operates iteratively through five stages (visualized in Figure 1):
- Prompt Sampling: Select a parent prompt from the archive.
- Candidate Generation: Use a Mutator LLM (πM) with few-shot prompting to generate multiple candidate offspring prompts based on the parent and a target descriptor.
- Diversity Filtering: Select a subset of behaviorally distinct candidates (using metrics like BLEU score) to promote diverse exploration.
- Response Evaluation: Obtain responses from the Target LLM (πT) for the filtered candidates and calculate their fitness scores using the Judge LLM (πJ).
- Update: Add candidate prompts exceeding the fitness threshold (η) to the corresponding cell in the multi-element archive.
Experimental Evaluation:
RainbowPlus was evaluated extensively against baseline methods:
- Comparison vs. Rainbow:
- Setup: Tested on 6 benchmark datasets (DNA, AQA, HQA, CHQA, DQA, BeaT) against 4 open-source LLMs (Llama-3.1-8B-Instruct, Gemma-2-9b-it, Qwen2.5-7B-Instruct, Ministral-8B-Instruct-2410). Two RainbowPlus variants (α: median fitness per cell, β: max fitness per cell) were included to mimic Rainbow's single-prompt constraint.
- Metrics: Attack Success Rate (ASR) using Llama-Guard-3-8B as an independent judge, and Diverse-Score (1 - Self-BLEU).
- Results: RainbowPlus and its variants significantly outperformed the reimplemented Rainbow baseline in ASR (e.g., 86.20% absolute ASR gain for RainbowPlus-β vs. Rainbow on Gemma-2-9b-it/DQA). RainbowPlus maintained comparable diversity (Diverse-Score ≈0.84) while generating up to 100x more unique prompts. Runtime was variable, sometimes faster and sometimes slower than Rainbow depending on the target LLM. Visualizations confirmed broader exploration of the prompt space.
- Comparison vs. State-of-the-Art (SOTA):
- Setup: Tested on the HarmBench dataset against 9 SOTA methods (GCG, Zero-Shot, PAIR, TAP, PAP, AutoDAN, AutoDAN-Turbo, Human Jailbreaks, Direct Request) across 12 LLMs (10 open-source 7B models, 2 closed-source: GPT-4o Mini, GPT-4.1 Nano). ASR calculation was adapted to match HarmBench standards.
- Results: RainbowPlus achieved the highest average ASR (81.1%) across all models, surpassing the strongest baseline AutoDAN-Turbo (77.2%) by 3.9%. It was notably faster, requiring ~1.45 hours compared to AutoDAN-Turbo's ~13.50 hours (excluding AutoDAN-T's training time), largely due to not needing a warm-up phase. It outperformed AutoDAN-Turbo on GPT-4o Mini but underperformed on GPT-4.1 Nano.
Contributions and Conclusion:
The main contributions are:
- The RainbowPlus framework with its multi-element archive and multi-prompt fitness evaluation for efficient and diverse adversarial prompt generation.
- Comprehensive empirical validation showing superiority over QD baselines and SOTA methods in ASR and prompt diversity/quantity, along with significant speed advantages over some methods.
- An open-source implementation to facilitate reproducibility and further research in LLM safety.
The paper concludes that RainbowPlus offers a scalable and effective evolutionary QD approach for LLM red-teaming. While acknowledging limitations like the manual definition of archive dimensions and potential underperformance on highly robust models without a warm-up phase, the authors propose future work on automated descriptor selection, warm-up phase integration, and scaling to larger models. The work emphasizes responsible red-teaming practices and aims to contribute to building safer LLMs.