Rainbow Teaming: QD Optimization for LLM Safety

Updated 13 December 2025

Rainbow Teaming is a MAP-Elites style quality–diversity approach that generates diverse and effective adversarial prompts for red-teaming large language models.
The methodology employs multi-stage semantic mutations and BLEU-based filtering to explore behavioral risk categories and attack styles systematically.
Empirical results show high attack success rates and prompt transferability across models, informing robust strategies for LLM safety improvements.

Rainbow Teaming is a quality–diversity (QD) optimization paradigm for automated red teaming of LLMs, systematically generating diverse sets of adversarial prompts that maximize both attack success rate (ASR) and coverage across behavioral descriptors such as risk category and attack style. Defined as a MAP-Elites–style search, Rainbow Teaming provides an open-ended, black-box method for probing, benchmarking, and strengthening LLM safety by revealing axes of vulnerability distributed throughout the input space. Its methodology, theoretical properties, algorithmic workflow, and practical impact have led to subsequent refinements and generalizations across the automated red-teaming literature (Samvelyan et al., 2024, Pala et al., 2024, Han et al., 2024, Dang et al., 21 Apr 2025).

1. Foundations: Quality–Diversity Search and MAP-Elites

Rainbow Teaming operationalizes adversarial prompt discovery as a multi-objective QD problem over the prompt space $X$ . Each element $x\in X$ is evaluated on:

Quality function $Q(x)$ : A measure of harmfulness or attack success, e.g., probability the target LLM produces an unsafe reply under a safety classifier or as judged by a reference LLM.
Diversity descriptor $b(x)$ : A mapping from prompts to a discrete behavioral grid $D$ . For safety domains, $D = R \times A$ , where $R$ indexes risk categories (e.g., “Violence”, “Privacy”), and $A$ indexes attack styles (e.g., “Role Play”, “Misspellings”).

The optimization target is, for all $d \in D$ ,

$\max_{x : b(x) = d} Q(x)$

while maximizing the aggregate coverage of $D$ (i.e., filling as many grid cells as possible with effective adversarial prompts) (Samvelyan et al., 2024, Pala et al., 2024).

The core search architecture follows MAP-Elites: maintain an archive $A_t : D \rightarrow (x_d, q_d)$ of best-known prompts per cell. Iterative updates preferentially select low-quality/frequently underexplored cells via a softmax-based sampling,

$\sigma(z_d) = \frac{\exp((1 - q_d)/T)}{\sum_{d'}\exp((1-q_{d'})/T)}$

so that search is adaptively focused on behavioral niches most needing improvement.

2. Core Algorithmic Workflow

The prototypical Rainbow Teaming loop, as formulated in (Samvelyan et al., 2024, Pala et al., 2024), consists of:

Initialization: Seed the archive with handcrafted or automatically sampled prompts covering the descriptor grid.
Cell Sampling: Select a parent cell $(r_i, a_j)$ with probability proportional to lack of quality.
Mutation: Generate a child prompt by two-stage semantic mutation:
- Risk mutation: Rewriting the parent prompt to target a specified risk category.
- Style mutation: Injecting a behavioral attack style. The mutator is typically a large, instruction-tuned LLM fine-tuned on (risk, style) pairs.
Diversity Filtering: Remove near-duplicate mutations via a BLEU-based similarity threshold.
Target Querying: Submit candidate prompts to the target LLM to collect output.
Judgement: Evaluate harmfulness (quality) using a judge model (preference-style LLM, learned reward model, or classifier).
Archive Update: If the new prompt outperforms the current cell’s incumbent, replace and update the archive.

This approach ensures that the archive increasingly accumulates prompts that are both highly effective (high $Q(x)$ ) and highly diverse (cover as much of $D$ as possible). Empirical instantiations adopt grid resolutions $\sim10\times10$ for (risk category, attack style) (Samvelyan et al., 2024, Han et al., 2024, Pala et al., 2024).

3. Empirical Performance, Coverage, and Transferability

Rainbow Teaming achieves high empirical ASRs across a range of LLMs. On Llama 2–chat {7B, 13B, 70B}, the approach yields Llama Guard ASR of up to $0.89\pm0.02$ , revealing hundreds of unique, effective adversarial prompts distributed across all risk categories and attack styles (Samvelyan et al., 2024).

Key performance characteristics include:

Coverage: Near-complete cell coverage is achieved after sufficient iterations, supporting thorough diagnostic exploration of model behaviors and vulnerabilities.
Transferability: Prompts discovered on one model (e.g., Llama 2-7B) retain substantial attack success when re-applied to larger models (mean off-diagonal ASR $\sim 0.67$ ), indicating the method’s general utility for red-teaming, rather than overfitting to model-specific idiosyncrasies.
Synthetic Data Utility: Using Rainbow Teaming–generated prompts for supervised fine-tuning of LLMs yields significant post-finetuning reductions in ASR (e.g., $0.92\rightarrow0.026$ under GPT-4 evaluation), without substantial degradation of upstream task accuracy (Samvelyan et al., 2024).

Rainbow Teaming has been extended beyond LLM safety to domains such as open-domain question answering (three-dimensional QD with topic, length, and interrogative structure) and cybersecurity (e.g., MITRE tactics × prompt length), achieving broad coverage and effective archive construction (Samvelyan et al., 2024).

4. Limitations and Subsequent Extensions

Despite its conceptual and empirical strengths, the original Rainbow Teaming formulation exhibits several limitations:

Slow Convergence: With only one mutation per iteration, archive filling is computationally intensive, requiring large numbers of model calls (Pala et al., 2024).
Mutator Resource Demand: Reliable cell targeting requires a heavily fine-tuned, large-parameter mutator LLM.
Judgement Bottleneck: Simple judge models may misrank harmfulness or inadvertently encourage reward hacking.
Archive Rigidity: Per-cell single-champion designs discard potentially valuable near-optimal alternative prompts (Dang et al., 21 Apr 2025).

Major extensions, each addressing specific Rainbow Teaming limitations, include:

a. Ferret: Multi-Mutation with Learned Reward Models

Ferret introduces multiple parallel prompt mutations ( $N\gg 1$ , e.g., $N=5$ ) per iteration and employs a trained reward model to reliably rank candidate prompt–response pairs for harmfulness. This yields higher overall ASR (95% on Llama 2-chat 7B), 15–44% reductions in time-to-threshold-ASR, and better transferability compared to Rainbow Teaming (Table below) (Pala et al., 2024).

Method	Llama2-7B ASR	Llama3-8B ASR	Time to 0.90 ASR (min)
Rainbow	0.49	0.60	—
Rainbow + CF	0.89	0.92	352
Ferret (RM)	0.95	0.94	299 (–15.2%)

b. Ruby Teaming: Memory-Augmented Quality–Diversity

Ruby Teaming adds a per-cell memory buffer, storing the last $k$ successful mutations and corresponding critique texts, which are incorporated as in-context guidance for future mutations (Han et al., 2024). This third memory dimension:

Improves both ASR (+20pp, from 54% to 74%) and diversity (Shannon's Evenness Index +6%, Simpson’s Diversity Index +3%).
Increases coverage, especially in under-represented risk categories.
Demonstrates that memory-augmented QD search efficiently discovers more evenly distributed attacks.

Metric	Rainbow	Ruby
SEI	0.89	0.95 (+6%)
SDI	0.87	0.90 (+3%)

c. RainbowPlus: Multi-Element Archive and Scalable Fitness

RainbowPlus generalizes the archive architecture to allow each cell to store multiple high-performing prompts, uses batch fitness evaluation via judge LLMs for efficiency, and scales coverage up ( $\sim$ 10,000 unique prompts per run, 100 $\times$ increase), while yielding $2$– $15\times$ higher ASR and $9\times$ faster execution versus prior methods (Dang et al., 21 Apr 2025).

Model	Rainbow ASR	RainbowPlus ASR
Llama3.1-8B	35.9%	71.13%
Gemma-2-9B-it	5.53%	83.27%
Qwen2.5-7B	29.34%	79.07%

5. Comparative Analysis and Metrics

The efficacy of Rainbow Teaming and its variants is systematically assessed by:

Attack Success Rate (ASR): Fraction of archive prompts causing the target LLM to generate unsafe outputs, as measured by reference classifiers or human inspection.
Diversity Metrics: Quantified via Shannon's Evenness Index (SEI), Simpson’s Diversity Index (SDI), and self-BLEU scores over discovered prompts to ensure coverage is not concentrated in a few behavioral niches.
Transferability: Cross-model ASR when adversarial prompt archives are applied to held-out LLMs.
Coverage: Proportion of filled cells in the hand-crafted behavior grid.

A notable finding is that self-BLEU–filtered Rainbow Teaming archives retain semantic diversity, even under strong attack success constraints, reinforcing QD search as a critical departure from narrow adversarial optimization (Samvelyan et al., 2024, Dang et al., 21 Apr 2025).

6. Extensions Beyond LLM Safety

The QD search mechanisms underlying Rainbow Teaming are not domain-specific. The approach has been extended to question answering (topic coverage, interrogative type diversity) and cybersecurity (MITRE TTPs × prompt length), as well as to general scalability benchmarking (HarmBench). In each domain, the archive-centric, QD-driven method exhibits superior coverage, transfer, and downstream data utility (Samvelyan et al., 2024, Dang et al., 21 Apr 2025).

7. Practical Considerations and Future Directions

Practical deployment of Rainbow Teaming and its extensions requires:

Selection and tuning of descriptor grid resolution to balance granularity and computational tractability.
Careful mutator LLM selection and fine-tuning to ensure mutation reliability across all descriptor bins.
Adoption of reward model or memory-augmented structures for accelerating convergence in resource-constrained settings.
Post-processing of prompt archives for effective safety fine-tuning or robust multitask learning.

Open directions include integrating QD red-teaming into RLHF pipelines for continual adversarial robustness, extending to multi-modal (vision+language) and task-general benchmarks, and developing richer, automated judgment models minimizing reliance on domain expertise (Samvelyan et al., 2024, Pala et al., 2024, Dang et al., 21 Apr 2025).

References:

Rainbow Teaming (Samvelyan et al., 2024), Ferret (Pala et al., 2024), Ruby Teaming (Han et al., 2024), RainbowPlus (Dang et al., 21 Apr 2025)

Markdown Upgrade to Chat

References (4)

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts (2024)

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique (2024)

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming (2024)

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rainbow Teaming.

Rainbow Teaming: QD Optimization for LLM Safety

1. Foundations: Quality–Diversity Search and MAP-Elites

2. Core Algorithmic Workflow

3. Empirical Performance, Coverage, and Transferability

4. Limitations and Subsequent Extensions

5. Comparative Analysis and Metrics

6. Extensions Beyond LLM Safety

7. Practical Considerations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Rainbow Teaming: QD Optimization for LLM Safety

1. Foundations: Quality–Diversity Search and MAP-Elites

2. Core Algorithmic Workflow

3. Empirical Performance, Coverage, and Transferability

4. Limitations and Subsequent Extensions

5. Comparative Analysis and Metrics

6. Extensions Beyond LLM Safety

7. Practical Considerations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research