AgenticRed: Automated Red-Teaming Pipeline

Updated 21 January 2026

AgenticRed is an automated pipeline that designs and optimizes agentic red-teaming systems targeting large language models through evolutionary techniques.
It features a multi-stage workflow with initialization, iterative meta-agent design, self-reflection debugging, and evolutionary selection to systematically improve attack strategies.
Empirical results show significant ASR improvements across benchmark models and strong transferability to proprietary systems, validating its robust optimization approach.

AgenticRed is an automated pipeline for the design and optimization of agentic red-teaming systems targeting LLMs. It employs LLM-powered in-context learning and evolutionary selection to construct multi-step, workflow-based attack systems—termed "agentic"—without human-driven design. AgenticRed treats red-teaming as a system design challenge: rather than optimizing attacker strategies within fixed human-crafted workflows, it evolves novel system architectures in code to maximize attack success against AI models. Its empirical results demonstrate significant improvements in Attack Success Rate (ASR) across multiple benchmark models and strong transferability to proprietary systems (Yuan et al., 20 Jan 2026).

1. AgenticRed Pipeline Structure and Workflow

AgenticRed is organized into a multi-stage pipeline composed of initialization, iterative meta-agent design and self-reflection, evolutionary selection, archive updating, and termination. The process is controlled by a system-level meta-agent (e.g., gpt-5-2025-08-07) that generates and iteratively debugs candidate red-teaming systems.

Pipeline Modules

Initialization: Begins with an archive $A_0$ of hand-designed baseline agentic systems (such as Self-Refine or JudgeScore-Guided AdvReasoning) and their fitness metrics.
Design Loop (Generations $n = 1\dots N$ ):
- Design Step: The meta-agent ingests the archive’s code and rationale, producing $M$ new "offspring" agentic systems ( $A^n_j$ ) as explicit code implementations and design explanations.
- Self-Reflection: Each candidate is compiled and tested on a small intent set $d \subset D$ ; errors trigger $k$ iterative debug steps via meta-agent self-reflection.
- Initial Evaluation: Computes fitness $f(A^n_j) = \mathrm{ASR}(A^n_j, T, J, d)$ per Eq. (1).
- Selection: Selects the fittest offspring $j^* = \underset{j}{\arg\max}\, f(A^n_j)$ as $A_n$ .
- Full Evaluation: Executes $A_n$ on a full search dataset $\tilde{D}$ , updating the archive.
Evolutionary Selection: Survival-of-the-fittest: only the top-scoring offspring in each generation persists.
In-Context Learning Setup: Prompts supply archive code, objectives, workflow templates, and examples; meta-agent outputs Python-style forward(taskInfo) routines coordinating attacker, feedbacker, and optimizer modules.
Termination: After $N$ generations or upon achieving $100\%$ ASR, pipeline halts and returns the archive.

Pseudocode Summary

Archive A0 ← {baseline systems}
for i in 1…N:
  for j in 1…M:
    A_candidate ← MetaAgent.generate_system(A0)
    for s in 1…k:
      try run A_candidate on d to compute ASR
      except error: MetaAgent.self_reflect(error)
    record fitness f_j = ASR(A_candidate)
  j* ← argmax(f_j)
  Ai ← A_candidate^{j*}
  evaluate Ai on full D̃
  A0 ← A0 ∪ {Ai}
return A0

2. Formalization: Red-teaming as System Design

Red-teaming systems are functions $A$ mapping intents $I$ (e.g., “How to build a bomb?”) to adversarial prompts $p = A(I)$ that, when submitted to model $T$ , elicit harmful output. Their fitness is measured by attack success against a judge function $J(T(p), I)$ , which evaluates whether a jailbreak occurred.

Formal Definitions

Attack Success Rate (ASR):

$\mathrm{ASR}(A, T, J, D) = \mathbb{E}_{I \sim D}[J(T(A(I)), I)]$

Agentic System Composition:
- Attacker (prompt generator)
- Feedbacker (refines/ranks candidate prompts)
- Optimizer (applies feedback)
- Helper utilities (get_response, get_jailbreak_result)
- Workflow hyperparameters (iteration count, wrappers, batch size, etc.)

Evolution is formalized as:

$A_{n} = \underset{A \in C_n}{\arg\max}\ \mathrm{ASR}(A, T, J, d)$

3. Evolutionary Optimization Procedure

AgenticRed's optimization is evolutionary at both meta-system and intra-system levels.

Meta-level Operators
- Selection: Retains only the best-of- $M$ candidate system per generation
- Variation: Meta-agent synthesizes crossover (combining distinct code blocks), mutation of meta-instructions, and novel workflow/template generation
Within-system Operators
- Crossover: Prompt composition from multiple candidates (e.g., combining halves)
- Mutation: Rule or wrapper injections (e.g., enforcing no disclaimers, blacklisting refusals, injecting JSON/API schemas)
- Selection Probabilities: Heuristic, implicitly 1 for top system

Typical mutation/crossover rates as realized in generated code resemble classical evolutionary algorithms ( $\mu \sim 0.6$ , $\chi \sim 0.6$ among elites), though not explicitly parameterized.

4. In-Context Learning and Automated System Refinement

Meta-agent workflows leverage dynamic, few-shot in-context learning for system synthesis. The prompt template includes:

Archive of prior agentic systems with fitness
Utility functions for system evaluation
Explicit instructions for designing multistage agentic workflows

The meta-agent generates agentic systems as Python-style forward(taskInfo) routines, integrating attacker sampling, evaluation, feed-back optimization, and early stopping. Each system’s code, rationale, and performance metrics are appended to the evolving context prompt.

5. Experimental Setup and Performance Metrics

Metrics:
- Attack Success Rate (ASR): Fraction of intents triggering a successful jailbreak (judge returns 1)
- Diversity: $1 - \mathrm{SelfBLEU\text{-}4}$ for generated prompts
Benchmarks and Models:
- HarmBench dataset of harmful intents (train/test split)
- Open-weight models: Llama-2-7B, Llama-3-8B
- Proprietary: GPT-3.5-Turbo, GPT-4o-mini, Claude-Sonnet-3.5
Attacker: Mistral-8x7B (locally hosted)
Judge: HarmBench-Llama-2-13b-cls or StrongREJECT
Search Parameters: $N=10$ generations, $M=3$ offspring/gen, $k=3$ debug attempts, $d=16$ small-eval intents, $|\tilde{D}|=50$ for full evaluation

6. Empirical Results

Attack Success Rate and Benchmarking

Model	AgenticRed ASR	AdvReasoning	AutoDAN-Turbo
Llama-2-7B	96%	60%	36%
Llama-3-8B	98%	88%	62%
GPT-3.5-Turbo	100%	–	–
GPT-4o-mini	100%	–	–
Claude-Sonnet-3.5	60%	–	36%

On HarmBench, AgenticRed delivers $+36$ percentage points (pp) ASR improvement over AdvReasoning for Llama-2-7B and $+28$ pp for Llama-3-8B.
On proprietary models, AgenticRed generalizes effectively: $100\%$ ASR for both GPT-3.5-Turbo and GPT-4o-mini, $60\%$ for Claude-Sonnet-3.5 ( $+24$ pp over prior SOTA).

StrongREJECT Results

Llama-2-7B: 0.48 (AgenticRed) vs. 0.12 (AutoDAN-Turbo)
Llama-3-8B: 0.59 vs. 0.23

Ablation Findings

No Evolutionary Pressure: Reduces ASR by approximately 6 pp (i.e., $M=1$ offspring).
Weaker Archive: Performance stagnates early without high-performing seeds like JudgeScore-Guided AdvReasoning.
Diversity Incentives: Reward shaping/sample rejection enables ASR vs. prompt diversity trade-off.

Query Efficiency

Training: $\sim$ 122k queries/success due to extensive search
Test: $\sim$ 339 queries/success, comparable with other agents

7. Discussion, Limitations, and Future Directions

AgenticRed’s principal advantage arises from automated exploration—exceeding the constrained design space of hand-crafted red-teaming pipelines. Evolutionary selection ensures systematic optimization without human tuning. In-context learning empowers the meta-agent to invent novel strategies (such as schema locking and refusal suppression), leveraging both code and metric feedback.

Transferability to black-box and proprietary targets validates system-level generalization.

Limitations

Compute Cost: High training overhead (~360 GPU-hours, ~$1k).
Exploration Collapse: Meta-agent can converge on homogeneous strategies (e.g., prefix enforcement); multi-objective or novelty-driven evolution (such as MAP-Elites) may mitigate.
Diversity–Performance Tradeoff: Pure ASR optimization yields mode collapse; Pareto-frontier evolution is suggested for future work.
Co-evolution Missing: Current pipeline does not evolve attacker systems against simultaneously patching defender models; future adversarial co-evolution is recommended.

AgenticRed establishes automated system design, powered by LLM evolution and in-context learning, as a robust paradigm for AI vulnerability assessment, supporting rapid adaptation to the evolving landscape of model architectures and red-teaming benchmarks (Yuan et al., 20 Jan 2026).

Markdown Upgrade to Chat

References (1)

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgenticRed.