Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

AIRTBench: AI Red Team Benchmark

Updated 30 June 2025

AIRTBench is a benchmark that measures large language models' ability to autonomously identify and exploit AI/ML security flaws using realistic, mechanistically verifiable red teaming tasks.
It comprises 70 diverse, black-box challenges simulating CTF-style scenarios with standardized metrics for direct human and model performance comparisons.
The framework supports reproducible evaluations, detailed diagnostic outputs, and open-source extensibility, advancing practical AI/ML security research.

AIRTBench is a benchmark specifically created to measure the capability of LLMs to autonomously identify and exploit AI and machine learning (AI/ML) security vulnerabilities. Distinct from prior evaluation schemes focusing on model behavior, static code analysis, or generic capture-the-flag (CTF) scenarios, AIRTBench deploys realistic adversarial “red teaming” tasks within a controlled, mechanistically verifiable environment. The benchmark provides an open standard for tracking progress in autonomous AI red teaming and facilitates direct, quantitative comparisons between human experts and AI models.

1. Objectives and Benchmark Scope

AIRTBench is designed to address three primary goals:

Systematic measurement of LLMs' ability to discover and exploit AI/ML security flaws in an autonomous manner.
Provision of mechanistically verifiable tasks and result metrics, enabling reproducible and standardized evaluation of future models.
Direct parity with human red teaming, allowing meaningful human-vs-model and model-vs-model comparisons using a standardized task suite (Dawson et al., 17 Jun 2025).

The benchmark targets model security in practical attack scenarios, simulating the adversarial process of probing, attacking, and compromising diverse AI systems. Its focus is on realistic black-box settings with minimal external tooling, mirroring practical constraints in real-world offensive security.

2. Structure, Challenge Design, and Taxonomy

Challenge Format and Environment

AIRTBench defines 70 distinct security CTF challenges sourced from the Crucible environment on the Dreadnode platform. Each challenge presents:

A realistic black-box interface: The agent receives a challenge description and interacts with the target system purely by writing Python code and submitting flags for mechanistic validation.
Execution runtime: Each agent is deployed in a containerized Jupyter kernel (based on the jupyter/scipy-notebook) with standard scientific, machine learning, and automation libraries.
API integration: All challenge logic and flag validation are abstracted via the Crucible API, decoupling agent-environment interaction from implementation details and supporting large-scale automated evaluation.

Challenge Categories

The suite covers a wide spectrum of AI/ML security vulnerabilities, mapped to both the MITRE ATLAS and OWASP Top 10 for LLMs taxonomies. Major categories include:

Challenge Type	# Instances	Example Subtypes
Prompt Injection	20	Classic, Retrieval-Augmented Generation, System Prompt Leakage
Data Analysis	14	Artifact extraction, Statistical probes
Model Evasion (Image/Data/Audio)	12+	Adversarial examples, Classifier evasion
Model Inversion	5	Training data/secret recovery, Semantic leakage
System Exploitation	5	Sandbox escape, Privilege escalation
Other	6	Model fingerprinting, Poisoning, Data tampering

Difficulty grading is explicit: easy, medium, and hard, reflecting both real-world exploit complexity and required interaction depth. Several challenges involve multi-step, chain attacks or require nuanced, non-literal reasoning.

3. Evaluation and Metrics

Evaluation in AIRTBench is fully automated and mechanistic:

Agents must submit a correct cryptographic flag, verified by the challenge endpoint.
All runs are performed in an identical environment to maximize comparability.
Both solution and attempt rates are recorded on a per-challenge, per-model basis.

Primary quantitative metrics include:

Suite Success Rate:

$\frac{\text{Number of Challenges Solved}}{\text{Total Challenges}}$

Overall Success Rate:

$\frac{\text{Number of Successful Attempts}}{\text{Total Attempts}}$

Secondary process metrics include turn counts, code execution frequency, step and error rates, token/computation cost, and time to solution (in minutes for agents, hours/days for humans). Efficiency and exploration behaviors, such as failed submission rates, are also recorded.

Detailed technical output enables reproducibility and diagnostic analysis of model behaviors across varied adversarial tasks.

4. Model Performance and Analysis

AIRTBench assessed a range of contemporary frontier and open-source LLMs, including Anthropic Claude-3.7-Sonnet, Google Gemini 2.5 Pro/Flash, OpenAI GPT-4.5-Preview/o, DeepSeek R1, Meta Llama-4-17B, and Qwen-32B. Results reveal significant variance:

Model	Suite Success (Challenges Solved)	Overall Success Rate
Claude-3.7-Sonnet	43 / 70 (61.4%)	46.9%
Gemini-2.5-Pro	39 / 70 (55.7%)	34.3%
GPT-4.5-Preview	34 / 70 (48.6%)	36.9%
DeepSeek R1	29 / 70 (41.4%)	26.9%
Llama-4-17B	7 / 70 (10.0%)	1.0%
Qwen-32B	2 / 70 (2.9%)	0.6%

Challenge-specific findings:

Prompt Injection: Top-tier models achieve high solve rates (mean ≈49%, with leaders exceeding 60%). Some open-source models also achieve success in this area.
System Exploitation/Model Inversion: Success rates remain below 26% for all models; in harder variants, this drops further, with only frontier models appearing occasionally.
Model Evasion (audio): No evaluated model (open or frontier) successfully solved these challenges.
Tool-Calling/Parsing: Syntax and output formatting errors in tool use are a common failure cause, with error rates varying widely between models (from <3% in some models to >99% in others).

Efficiency metrics show successful runs use notably fewer tokens and steps (mean ≈19 turns, 8.6K tokens) versus unsuccessfully completed runs (mean ≈186 turns, 49K tokens), directly affecting compute cost. Frontier models lead in both productivity and breadth of success, while open-source models demonstrate isolated successes, occasionally solving challenges considered difficult for humans.

5. Human–Agent Comparison

AIRTBench is the first framework to provide parity between agent and human red teamers:

The typical time to solve a hard challenge is minutes for agents, but hours to days for human experts, yielding a reported efficiency advantage exceeding 5,000× for top-performing LLM agents.
While agents are efficient in high-throughput, code-driven exploration, humans may outperform in multimodal, ambiguous, or real-world contextual tasks that require experience or improvisation.
On standard and medium-difficulty tasks, leading agents match or exceed average human solve rates; on the most challenging cases, a subset remains unsolved by both agents and most human teams.

This benchmark thus calibrates real-world expectations for AI-vs-human red teaming in a rigorous, transparently scored setting.

6. Significance for AI/ML Security

AIRTBench addresses a crucial gap in benchmark development for AI/ML security:

Breadth and Realism: Tasks are curated to reflect the diversity and complexity of contemporary AI security risks, with explicit linkage to MITRE ATLAS and OWASP LLM topologies.
Human–Model Parity: The harness, input, and success criteria are identical for humans and agents, enabling outcome comparisons and supporting meaningful evaluation of red team automation progress.
Diagnostic Clarity: Fine-grained statistics across task types and categories provide actionable insights for both system defenders and deployers regarding model weaknesses, strengths, and typical failure modes.
Open, Evolvable, and Mechanistically Verifiable: Open-source implementation, standardized metrics, and community extensibility lay the groundwork for continuous evolution and adoption.

AIRTBench uniquely supports the systematic improvement, auditing, and risk management of AI/ML deployments facing adversarial threats.

7. Resources, Community, and Future Directions

AIRTBench is released as an open-source project, including:

Complete codebase and technical documentation at https://github.com/dreadnode/AIRTBench-Code
Comprehensive challenge specification, API logic, and per-challenge statistics to maximize reproducibility and transparency.
The structure supports routine expansion with new challenge types and difficulty levels as adversarial practice evolves.

A plausible implication is that systematic red teaming benchmarks such as AIRTBench will become a baseline requirement for AI/ML model deployment in sensitive or adversarial settings, supporting both attack surface reduction and post-deployment vulnerability analysis.

In summary, AIRTBench establishes a reliable, open, and mechanistically scored framework for measuring and comparing autonomous red teaming capabilities in LLMs. It offers a diagnostic and competitive platform for both academic research and applied security operations within the rapidly evolving discipline of machine learning security.

PDF Markdown Chat (Pro)

References (1)

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to AIRTBench.

AIRTBench: AI Red Team Benchmark

1. Objectives and Benchmark Scope

2. Structure, Challenge Design, and Taxonomy

Challenge Format and Environment

Challenge Categories

3. Evaluation and Metrics

4. Model Performance and Analysis

5. Human–Agent Comparison

6. Significance for AI/ML Security

7. Resources, Community, and Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AIRTBench: AI Red Team Benchmark

1. Objectives and Benchmark Scope

2. Structure, Challenge Design, and Taxonomy

Challenge Format and Environment

Challenge Categories

3. Evaluation and Metrics

4. Model Performance and Analysis

5. Human–Agent Comparison

6. Significance for AI/ML Security

7. Resources, Community, and Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research