Papers
Topics
Authors
Recent
Search
2000 character limit reached

AgenticRed: Automated Red-Teaming Pipeline

Updated 21 January 2026
  • AgenticRed is an automated pipeline that designs and optimizes agentic red-teaming systems targeting large language models through evolutionary techniques.
  • It features a multi-stage workflow with initialization, iterative meta-agent design, self-reflection debugging, and evolutionary selection to systematically improve attack strategies.
  • Empirical results show significant ASR improvements across benchmark models and strong transferability to proprietary systems, validating its robust optimization approach.

AgenticRed is an automated pipeline for the design and optimization of agentic red-teaming systems targeting LLMs. It employs LLM-powered in-context learning and evolutionary selection to construct multi-step, workflow-based attack systems—termed "agentic"—without human-driven design. AgenticRed treats red-teaming as a system design challenge: rather than optimizing attacker strategies within fixed human-crafted workflows, it evolves novel system architectures in code to maximize attack success against AI models. Its empirical results demonstrate significant improvements in Attack Success Rate (ASR) across multiple benchmark models and strong transferability to proprietary systems (Yuan et al., 20 Jan 2026).

1. AgenticRed Pipeline Structure and Workflow

AgenticRed is organized into a multi-stage pipeline composed of initialization, iterative meta-agent design and self-reflection, evolutionary selection, archive updating, and termination. The process is controlled by a system-level meta-agent (e.g., gpt-5-2025-08-07) that generates and iteratively debugs candidate red-teaming systems.

Pipeline Modules

  • Initialization: Begins with an archive A0A_0 of hand-designed baseline agentic systems (such as Self-Refine or JudgeScore-Guided AdvReasoning) and their fitness metrics.
  • Design Loop (Generations n=1Nn = 1\dots N):
    • Design Step: The meta-agent ingests the archive’s code and rationale, producing MM new "offspring" agentic systems (AjnA^n_j) as explicit code implementations and design explanations.
    • Self-Reflection: Each candidate is compiled and tested on a small intent set dDd \subset D; errors trigger kk iterative debug steps via meta-agent self-reflection.
    • Initial Evaluation: Computes fitness f(Ajn)=ASR(Ajn,T,J,d)f(A^n_j) = \mathrm{ASR}(A^n_j, T, J, d) per Eq. (1).
    • Selection: Selects the fittest offspring j=argmaxjf(Ajn)j^* = \underset{j}{\arg\max}\, f(A^n_j) as AnA_n.
    • Full Evaluation: Executes AnA_n on a full search dataset D~\tilde{D}, updating the archive.
  • Evolutionary Selection: Survival-of-the-fittest: only the top-scoring offspring in each generation persists.
  • In-Context Learning Setup: Prompts supply archive code, objectives, workflow templates, and examples; meta-agent outputs Python-style forward(taskInfo) routines coordinating attacker, feedbacker, and optimizer modules.
  • Termination: After NN generations or upon achieving 100%100\% ASR, pipeline halts and returns the archive.

Pseudocode Summary

1
2
3
4
5
6
7
8
9
10
11
12
13
Archive A0  {baseline systems}
for i in 1N:
  for j in 1M:
    A_candidate  MetaAgent.generate_system(A0)
    for s in 1k:
      try run A_candidate on d to compute ASR
      except error: MetaAgent.self_reflect(error)
    record fitness f_j = ASR(A_candidate)
  j*  argmax(f_j)
  Ai  A_candidate^{j*}
  evaluate Ai on full D̃
  A0  A0  {Ai}
return A0

2. Formalization: Red-teaming as System Design

Red-teaming systems are functions AA mapping intents II (e.g., “How to build a bomb?”) to adversarial prompts p=A(I)p = A(I) that, when submitted to model TT, elicit harmful output. Their fitness is measured by attack success against a judge function J(T(p),I)J(T(p), I), which evaluates whether a jailbreak occurred.

Formal Definitions

  • Attack Success Rate (ASR):

ASR(A,T,J,D)=EID[J(T(A(I)),I)]\mathrm{ASR}(A, T, J, D) = \mathbb{E}_{I \sim D}[J(T(A(I)), I)]

  • Agentic System Composition:
    • Attacker (prompt generator)
    • Feedbacker (refines/ranks candidate prompts)
    • Optimizer (applies feedback)
    • Helper utilities (get_response, get_jailbreak_result)
    • Workflow hyperparameters (iteration count, wrappers, batch size, etc.)

Evolution is formalized as:

An=argmaxACn ASR(A,T,J,d)A_{n} = \underset{A \in C_n}{\arg\max}\ \mathrm{ASR}(A, T, J, d)

3. Evolutionary Optimization Procedure

AgenticRed's optimization is evolutionary at both meta-system and intra-system levels.

  • Meta-level Operators
    • Selection: Retains only the best-of-MM candidate system per generation
    • Variation: Meta-agent synthesizes crossover (combining distinct code blocks), mutation of meta-instructions, and novel workflow/template generation
  • Within-system Operators
    • Crossover: Prompt composition from multiple candidates (e.g., combining halves)
    • Mutation: Rule or wrapper injections (e.g., enforcing no disclaimers, blacklisting refusals, injecting JSON/API schemas)
    • Selection Probabilities: Heuristic, implicitly 1 for top system

Typical mutation/crossover rates as realized in generated code resemble classical evolutionary algorithms (μ0.6\mu \sim 0.6, χ0.6\chi \sim 0.6 among elites), though not explicitly parameterized.

4. In-Context Learning and Automated System Refinement

Meta-agent workflows leverage dynamic, few-shot in-context learning for system synthesis. The prompt template includes:

  • Archive of prior agentic systems with fitness
  • Utility functions for system evaluation
  • Explicit instructions for designing multistage agentic workflows

The meta-agent generates agentic systems as Python-style forward(taskInfo) routines, integrating attacker sampling, evaluation, feed-back optimization, and early stopping. Each system’s code, rationale, and performance metrics are appended to the evolving context prompt.

5. Experimental Setup and Performance Metrics

  • Metrics:
    • Attack Success Rate (ASR): Fraction of intents triggering a successful jailbreak (judge returns 1)
    • Diversity: 1SelfBLEU-41 - \mathrm{SelfBLEU\text{-}4} for generated prompts
  • Benchmarks and Models:
    • HarmBench dataset of harmful intents (train/test split)
    • Open-weight models: Llama-2-7B, Llama-3-8B
    • Proprietary: GPT-3.5-Turbo, GPT-4o-mini, Claude-Sonnet-3.5
  • Attacker: Mistral-8x7B (locally hosted)
  • Judge: HarmBench-Llama-2-13b-cls or StrongREJECT
  • Search Parameters: N=10N=10 generations, M=3M=3 offspring/gen, k=3k=3 debug attempts, d=16d=16 small-eval intents, D~=50|\tilde{D}|=50 for full evaluation

6. Empirical Results

Attack Success Rate and Benchmarking

Model AgenticRed ASR AdvReasoning AutoDAN-Turbo
Llama-2-7B 96% 60% 36%
Llama-3-8B 98% 88% 62%
GPT-3.5-Turbo 100%
GPT-4o-mini 100%
Claude-Sonnet-3.5 60% 36%
  • On HarmBench, AgenticRed delivers +36+36 percentage points (pp) ASR improvement over AdvReasoning for Llama-2-7B and +28+28 pp for Llama-3-8B.
  • On proprietary models, AgenticRed generalizes effectively: 100%100\% ASR for both GPT-3.5-Turbo and GPT-4o-mini, 60%60\% for Claude-Sonnet-3.5 (+24+24 pp over prior SOTA).

StrongREJECT Results

  • Llama-2-7B: 0.48 (AgenticRed) vs. 0.12 (AutoDAN-Turbo)
  • Llama-3-8B: 0.59 vs. 0.23

Ablation Findings

  • No Evolutionary Pressure: Reduces ASR by approximately 6 pp (i.e., M=1M=1 offspring).
  • Weaker Archive: Performance stagnates early without high-performing seeds like JudgeScore-Guided AdvReasoning.
  • Diversity Incentives: Reward shaping/sample rejection enables ASR vs. prompt diversity trade-off.

Query Efficiency

  • Training: \sim122k queries/success due to extensive search
  • Test: \sim339 queries/success, comparable with other agents

7. Discussion, Limitations, and Future Directions

AgenticRed’s principal advantage arises from automated exploration—exceeding the constrained design space of hand-crafted red-teaming pipelines. Evolutionary selection ensures systematic optimization without human tuning. In-context learning empowers the meta-agent to invent novel strategies (such as schema locking and refusal suppression), leveraging both code and metric feedback.

Transferability to black-box and proprietary targets validates system-level generalization.

Limitations

  • Compute Cost: High training overhead (~360 GPU-hours, ~$1k).
  • Exploration Collapse: Meta-agent can converge on homogeneous strategies (e.g., prefix enforcement); multi-objective or novelty-driven evolution (such as MAP-Elites) may mitigate.
  • Diversity–Performance Tradeoff: Pure ASR optimization yields mode collapse; Pareto-frontier evolution is suggested for future work.
  • Co-evolution Missing: Current pipeline does not evolve attacker systems against simultaneously patching defender models; future adversarial co-evolution is recommended.

AgenticRed establishes automated system design, powered by LLM evolution and in-context learning, as a robust paradigm for AI vulnerability assessment, supporting rapid adaptation to the evolving landscape of model architectures and red-teaming benchmarks (Yuan et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgenticRed.