AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models (2506.14682v1)

Published 17 Jun 2025 in cs.CR

Abstract: We introduce AIRTBench, an AI red teaming benchmark for evaluating LLMs' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems. Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall). Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model inversion challenges (below 26%, even for the best performers). Frontier models are far outpacing open-source alternatives, with the best truly open-source model (Llama-4-17B) solving 7 challenges (10%, 1.0% overall), though demonstrating specialized capabilities on certain hard challenges. Compared to human security researchers, LLMs solve challenges with remarkable efficiency completing in minutes what typically takes humans hours or days-with efficiency advantages of over 5,000x on hard challenges. Our contribution fills a critical gap in the evaluation landscape, providing the first comprehensive benchmark specifically designed to measure and track progress in autonomous AI red teaming capabilities.

PDF Abstract

AIRTBench: Assessing AI Red Teaming Capabilities

In the paper titled "AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in LLMs," the authors offer a detailed exploration into the evaluation of LLMs' (LLMs) ability to autonomously discover and exploit vulnerabilities within AI/ML systems. Introducing AIRTBench, a comprehensive benchmark suite, they provide 70 realistic capture-the-flag (CTF) challenges. These challenges are hosted on the Crucible challenge environment via the Dreadnode platform.

Numerical Insights

Results indicate Claude-3.7-Sonnet's superiority, solving 61.4% of challenges, closely followed by Gemini-2.5-Pro at 55.7% and GPT-4.5 at 48.6%. The paper notes substantial disparities between frontier and open-source models, with Llama-4-17B only solving 10% of challenges. The paper underscores how frontier models are highly efficient at prompt injection yet struggle with system exploitation and model inversion, with a success rate under 26% for the latter. These gaps highlight evolving boundaries in AI red teaming capabilities.

Practical Implications

AIRTBench aims to bridge the gap between academic research and real-world security applications. By benchmarking LLMs' autonomous security capabilities, it offers actionable insights for SOC teams, red teams, penetration testers, and security engineers. Additionally, the work provides vulnerability management teams with concrete examples categorized under MITRE ATLAS and OWASP frameworks. This enables a comprehensive understanding of diverse attack vectors relevant to design safeguards, prioritize mitigations, and anticipate future threats in AI-dependent infrastructures.

Novel Evaluation Framework

AIRTBench's methodology diverges from traditional scraped CTF challenges, employing an environment that simulates authentic adversarial conditions through black-box scenarios. By maintaining human-model parity and leveraging mechanistic verification for solution submission, it allows for direct comparability of human versus AI problem-solving. Challenges defined within common security task types provide granular assessments across difficulty levels.

Future Directions

The findings set benchmarks for tracking progress in AI red teaming capabilities. As AI models are increasingly applied in critical sectors, the paper underscores the importance of continuously expanding evaluation tools to cover emerging attack vectors and compute-enabled strategies. The future focus of initiatives like AIRTBench will revolve around refining benchmark tools and integrating extendable challenge sets to ensure robust evaluation standards remain pertinent.

Economic Considerations

With token usage demonstrating a direct correlation to cost, the paper offers insights into computational expenses. Successful runs are notably more cost-effective compared to unsuccessful ones, outlining the need for efficient resource allocation in security assessments leveraging commercial model APIs.

AIRTBench's contributions significantly augment the understanding of autonomous AI's role in proactive security testing, providing a structured framework reflecting real-world security challenges that models may encounter. As the landscape of cybersecurity continues to evolve, benchmarks like AIRTBench are invaluable assets in advancing AI's capability as an autonomous red teaming agent.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ads Dawson (2 papers)
Rob Mulla (3 papers)
Nick Landers (3 papers)
Shane Caldwell (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/hetmehtaa/status/1936161422356226122

https://twitter.com/Sujeet/status/1936175727369764998

YouTube

Show All Videos