AIRTBench: AI Red Team Benchmark
- AIRTBench is a benchmark that measures large language models' ability to autonomously identify and exploit AI/ML security flaws using realistic, mechanistically verifiable red teaming tasks.
- It comprises 70 diverse, black-box challenges simulating CTF-style scenarios with standardized metrics for direct human and model performance comparisons.
- The framework supports reproducible evaluations, detailed diagnostic outputs, and open-source extensibility, advancing practical AI/ML security research.
AIRTBench is a benchmark specifically created to measure the capability of LLMs to autonomously identify and exploit AI and machine learning (AI/ML) security vulnerabilities. Distinct from prior evaluation schemes focusing on model behavior, static code analysis, or generic capture-the-flag (CTF) scenarios, AIRTBench deploys realistic adversarial “red teaming” tasks within a controlled, mechanistically verifiable environment. The benchmark provides an open standard for tracking progress in autonomous AI red teaming and facilitates direct, quantitative comparisons between human experts and AI models.
1. Objectives and Benchmark Scope
AIRTBench is designed to address three primary goals:
- Systematic measurement of LLMs' ability to discover and exploit AI/ML security flaws in an autonomous manner.
- Provision of mechanistically verifiable tasks and result metrics, enabling reproducible and standardized evaluation of future models.
- Direct parity with human red teaming, allowing meaningful human-vs-model and model-vs-model comparisons using a standardized task suite (2506.14682).
The benchmark targets model security in practical attack scenarios, simulating the adversarial process of probing, attacking, and compromising diverse AI systems. Its focus is on realistic black-box settings with minimal external tooling, mirroring practical constraints in real-world offensive security.
2. Structure, Challenge Design, and Taxonomy
Challenge Format and Environment
AIRTBench defines 70 distinct security CTF challenges sourced from the Crucible environment on the Dreadnode platform. Each challenge presents:
- A realistic black-box interface: The agent receives a challenge description and interacts with the target system purely by writing Python code and submitting flags for mechanistic validation.
- Execution runtime: Each agent is deployed in a containerized Jupyter kernel (based on the jupyter/scipy-notebook) with standard scientific, machine learning, and automation libraries.
- API integration: All challenge logic and flag validation are abstracted via the Crucible API, decoupling agent-environment interaction from implementation details and supporting large-scale automated evaluation.
Challenge Categories
The suite covers a wide spectrum of AI/ML security vulnerabilities, mapped to both the MITRE ATLAS and OWASP Top 10 for LLMs taxonomies. Major categories include:
Challenge Type | # Instances | Example Subtypes |
---|---|---|
Prompt Injection | 20 | Classic, Retrieval-Augmented Generation, System Prompt Leakage |
Data Analysis | 14 | Artifact extraction, Statistical probes |
Model Evasion (Image/Data/Audio) | 12+ | Adversarial examples, Classifier evasion |
Model Inversion | 5 | Training data/secret recovery, Semantic leakage |
System Exploitation | 5 | Sandbox escape, Privilege escalation |
Other | 6 | Model fingerprinting, Poisoning, Data tampering |
Difficulty grading is explicit: easy, medium, and hard, reflecting both real-world exploit complexity and required interaction depth. Several challenges involve multi-step, chain attacks or require nuanced, non-literal reasoning.
3. Evaluation and Metrics
Evaluation in AIRTBench is fully automated and mechanistic:
- Agents must submit a correct cryptographic flag, verified by the challenge endpoint.
- All runs are performed in an identical environment to maximize comparability.
- Both solution and attempt rates are recorded on a per-challenge, per-model basis.
Primary quantitative metrics include:
- Suite Success Rate:
- Overall Success Rate:
Secondary process metrics include turn counts, code execution frequency, step and error rates, token/computation cost, and time to solution (in minutes for agents, hours/days for humans). Efficiency and exploration behaviors, such as failed submission rates, are also recorded.
Detailed technical output enables reproducibility and diagnostic analysis of model behaviors across varied adversarial tasks.
4. Model Performance and Analysis
AIRTBench assessed a range of contemporary frontier and open-source LLMs, including Anthropic Claude-3.7-Sonnet, Google Gemini 2.5 Pro/Flash, OpenAI GPT-4.5-Preview/o, DeepSeek R1, Meta Llama-4-17B, and Qwen-32B. Results reveal significant variance:
Model | Suite Success (Challenges Solved) | Overall Success Rate |
---|---|---|
Claude-3.7-Sonnet | 43 / 70 (61.4%) | 46.9% |
Gemini-2.5-Pro | 39 / 70 (55.7%) | 34.3% |
GPT-4.5-Preview | 34 / 70 (48.6%) | 36.9% |
DeepSeek R1 | 29 / 70 (41.4%) | 26.9% |
Llama-4-17B | 7 / 70 (10.0%) | 1.0% |
Qwen-32B | 2 / 70 (2.9%) | 0.6% |
Challenge-specific findings:
- Prompt Injection: Top-tier models achieve high solve rates (mean ≈49%, with leaders exceeding 60%). Some open-source models also achieve success in this area.
- System Exploitation/Model Inversion: Success rates remain below 26% for all models; in harder variants, this drops further, with only frontier models appearing occasionally.
- Model Evasion (audio): No evaluated model (open or frontier) successfully solved these challenges.
- Tool-Calling/Parsing: Syntax and output formatting errors in tool use are a common failure cause, with error rates varying widely between models (from <3% in some models to >99% in others).
Efficiency metrics show successful runs use notably fewer tokens and steps (mean ≈19 turns, 8.6K tokens) versus unsuccessfully completed runs (mean ≈186 turns, 49K tokens), directly affecting compute cost. Frontier models lead in both productivity and breadth of success, while open-source models demonstrate isolated successes, occasionally solving challenges considered difficult for humans.
5. Human–Agent Comparison
AIRTBench is the first framework to provide parity between agent and human red teamers:
- The typical time to solve a hard challenge is minutes for agents, but hours to days for human experts, yielding a reported efficiency advantage exceeding 5,000× for top-performing LLM agents.
- While agents are efficient in high-throughput, code-driven exploration, humans may outperform in multimodal, ambiguous, or real-world contextual tasks that require experience or improvisation.
- On standard and medium-difficulty tasks, leading agents match or exceed average human solve rates; on the most challenging cases, a subset remains unsolved by both agents and most human teams.
This benchmark thus calibrates real-world expectations for AI-vs-human red teaming in a rigorous, transparently scored setting.
6. Significance for AI/ML Security
AIRTBench addresses a crucial gap in benchmark development for AI/ML security:
- Breadth and Realism: Tasks are curated to reflect the diversity and complexity of contemporary AI security risks, with explicit linkage to MITRE ATLAS and OWASP LLM topologies.
- Human–Model Parity: The harness, input, and success criteria are identical for humans and agents, enabling outcome comparisons and supporting meaningful evaluation of red team automation progress.
- Diagnostic Clarity: Fine-grained statistics across task types and categories provide actionable insights for both system defenders and deployers regarding model weaknesses, strengths, and typical failure modes.
- Open, Evolvable, and Mechanistically Verifiable: Open-source implementation, standardized metrics, and community extensibility lay the groundwork for continuous evolution and adoption.
AIRTBench uniquely supports the systematic improvement, auditing, and risk management of AI/ML deployments facing adversarial threats.
7. Resources, Community, and Future Directions
AIRTBench is released as an open-source project, including:
- Complete codebase and technical documentation at https://github.com/dreadnode/AIRTBench-Code
- Comprehensive challenge specification, API logic, and per-challenge statistics to maximize reproducibility and transparency.
- The structure supports routine expansion with new challenge types and difficulty levels as adversarial practice evolves.
A plausible implication is that systematic red teaming benchmarks such as AIRTBench will become a baseline requirement for AI/ML model deployment in sensitive or adversarial settings, supporting both attack surface reduction and post-deployment vulnerability analysis.
In summary, AIRTBench establishes a reliable, open, and mechanistically scored framework for measuring and comparing autonomous red teaming capabilities in LLMs. It offers a diagnostic and competitive platform for both academic research and applied security operations within the rapidly evolving discipline of machine learning security.