Catastrophic Cyber Capabilities Benchmark (3CB)

Updated 17 March 2026

3CB is a standardized framework that assesses LLM offensive abilities through realistic multi-stage cyber attack simulations based on the MITRE ATT&CK model.
It integrates Docker-based challenge environments with turn-based elicitation, enabling structured evaluation of tasks like reconnaissance, initial access, and lateral movement.
Empirical results show frontier models like GPT-4o outperform open-source alternatives, emphasizing the need for enhanced AI safety, governance, and regulatory standards.

The Catastrophic Cyber Capabilities Benchmark (3CB) is a standardized framework for rigorously evaluating the real-world offensive capabilities of LLM agents in autonomous cyber operations. 3CB uses a formal threat model derived from the MITRE ATT&CK framework and targets the measurement and comparison of LLMs’ proficiency in realistic cyber offense tasks—spanning reconnaissance, exploitation, persistence, lateral movement, and impact within enterprise environments. Empirical results indicate that frontier LLMs such as GPT-4o and Claude 3.5 Sonnet demonstrate high levels of autonomous multi-stage attack capability, greatly surpassing open-source models, and highlighting substantial implications for AI governance, model safety, and regulatory regimes (Anurin et al., 2024).

1. Motivation and Threat Model

3CB arises at the intersection of rapidly advancing LLM reasoning/tool-use abilities and the increasing risk of catastrophic cyber events. Driven partly by expert forecasts—such as the World Economic Forum's projection that 93% of cybersecurity leaders anticipate “AI-induced cyber catastrophes” by 2026—3CB aims to provide a robust empirical foundation for anticipating, assessing, and mitigating potential threats where LLMs might enable large-scale disruption events.

The underpinning threat model assumes an adversarial agent equipped with an LLM, progressing autonomously through the canonical MITRE ATT&CK cyber kill chain. The stages are:

Reconnaissance: Target identification via port scanning or similar techniques.
Initial Access: Techniques such as spear phishing for intrusion.
Execution & Persistence: Remote code execution, establishing backdoors, maintaining access.
Lateral Movement: Spreading through SSH hijacking or privilege escalation.
Data Exfiltration: Unauthorised data transfer.
Impact: Actions to disrupt, deface, or compromise systems.

By instantiating realistic procedures for each phase, 3CB evaluates the agent’s ability to progress from zero knowledge to observable system compromise, as evidenced by the retrieval of hidden “flags” denoting each successful offensive action.

2. Architecture and Technical Design

3CB consists of two primary components: the 3CB Harness and the 3CB Challenge Set.

3CB Harness:
- Manages isolated challenge environments using Docker containers, interfacing with LLM agents through virtual TTYs.
- Translates free-form model outputs into environment actions using protocols such as MARKDOWN_CODE_BLOCKS, XML, or JSON.
- Implements detection for success/failure, imposes turn and time constraints, and maintains exhaustive interaction logs for auditability.
3CB Challenge Set:
- Composed of fifteen capture-the-flag (CTF) style tasks, each mapped to a distinct sub-technique in MITRE ATT&CK v15.1.
- Challenge examples include:
- nmap (Reconnaissance): Port scanning, HTTP service detection, flag retrieval.
- spearphish (Initial Access): Malicious link delivery via SMTP, header flag extraction.
- rce (Execution): Finding and exploiting a Python endpoint.
- sshhijack (Lateral Movement): SSH agent session hijacking.
- exfil (Exfiltration): Data download over restricted links, hash computation.
- deface (Impact): Web server compromise to reveal a service flag.

Each challenge is parameterized to simulate enterprise-class attack scenarios, with strict isolation to ensure reproducibility and security.

3. Evaluation Procedures

The evaluation protocol comprises three stages:

A. Challenge Specification:
- Every challenge instance is defined by a YAML configuration enumerating container images, file system layouts, initialization routines, prompt templates, elicitation parameters, and allowable turn counts.
B. Agent Configuration and Elicitation:
- For each LLM, a spectrum of elicitation variants—altering prompt templates, system messages, and protocol formats—are explored to maximize observed performance.
- Interaction is strictly turn-based, with the harness mediating between agent completions and terminal emulation.
C. Run Execution and Monitoring:
- Each (model, challenge, configuration) tuple is executed at least 10 times under non-zero temperature settings (default or 0.7), ensuring statistical robustness.
- Logging is comprehensive (timestamps, commands, model outputs, environment feedback, outcome flags), with containers reset and software version-pinned for strict reproducibility.

4. Scoring Metrics and Analytical Framework

Formally, model performance is measured as follows:

For model $m$ , challenge $c$ , and run $i$ with binary outcome $y_{m,c,i} \in \{0, 1\}$ :

$\text{success\_rate}_{m,c} = \frac{1}{N}\sum_{i=1}^{N} y_{m,c,i}$

The aggregate 3CB score for model $m$ over all $C$ challenges:

$S_{3CB}(m) = \frac{1}{|C|}\sum_{c=1}^{|C|}\text{success\_rate}_{m,c}$

Mean success probability across all models and challenges:

$\bar{y} = \frac{1}{M \cdot |C|} \sum_{m=1}^{M} \sum_{c=1}^{|C|} \frac{1}{N} \sum_{i=1}^{N} y_{m,c,i}$

To disentangle effects from model identity, challenge, and elicitation protocol, a linear mixed-effects model is employed:

$y_{ij} = \beta_0 + \beta_1 x_{1ij} + \beta_2 x_{2ij} + \beta_3 x_{1ij} x_{2ij} + u_j + \epsilon_{ij}$

where $c$ 0 and $c$ 1 are fixed effects (e.g., model type, protocol), $c$ 2 is the random effect for challenge $c$ 3.

5. Empirical Findings

An evaluation of 14 LLMs—including GPT-4o and Claude 3.5 Sonnet alongside open models (Meta Llama 3.1, Mixtral, Qwen2, DeepSeek, and OpenAI o1-family)—across the 15 3CB challenges revealed significant disparities in offensive capability.

Summary of results:

Model Name	$c$ 4 (Mean Success Rate)	Challenges Solved (out of 15)
GPT-4o	≈ 0.73	11
Claude 3.5 Sonnet	≈ 0.59	9
Llama 3.1, Mixtral...	≤ 0.13	≤ 2

Frontier proprietary models excelled at tasks such as privilege escalation (binpwn, setuid), defense evasion (sign), and persistence (authkeys), with effect sizes $c$ 5 ( $c$ 6). However, residual brittleness was observed (e.g., challenges nodecontrol and rce).

Prompt protocol sensitivity was pronounced: e.g., switching from JSON to XML yielded a +24.7 percentage point increase for GPT-4o ( $c$ 7), indicating that model output format has substantial influence on elicitation success even for advanced models.

Refusal rates varied starkly. Safety-tuned models (OpenAI o1-family) had refusal rates exceeding 80% on most tasks, greatly limiting their observable offensive potential. Models with less or no safety tuning (GTP-4o Mini, Claude) attempted significantly more tasks, seldom refusing.

6. Policy Implications and Future Trajectories

Findings establish five principal implications:

Mandatory Pre-deployment Evaluations: Systematic measurement of offensive capabilities via rigorous benchmarks like 3CB prior to model release is essential. Surface-level QA or ad-hoc red teaming are insufficient to expose LLM attack surfaces.
Elicitation-Aware Safety Mechanisms: Prompt-based refusal safeguards are porous. Prompt engineering frequently circumvents such filters, necessitating the development of intrinsic adversarial detection and provenance tracking techniques.
Regulatory Standardization: Legislators should require the publication of aggregate offensive measures (e.g., $c$ 8, refusal rates, maximal elicitation outcomes) with performance claims. While regulations like the EU AI Act and US DSIT acknowledge high-risk domains, unified standards for reporting such capabilities are lacking.
Responsible Open-Source Release: Although today’s open models lack broad offensive proficiency, rapid innovation could close the gap. Responsible release frameworks must integrate dual-use risk assessments and define capability tiers.
Mitigation Research: Existing unlearning methods (e.g., RMU from WMDP) may excise some malicious knowledge, but empirical elicitation experiments show that adversaries can often circumvent such mitigations. Technical advances such as intrinsic capability controls or cryptographic auditing of policy adherence are likely prerequisites for robust prevention.

This suggests an urgent research and regulatory agenda: enhancing the robustness, transparency, and intrinsic safety of LLMs as they begin to exhibit autonomous, sophisticated cyber offense competence.

7. Relationship to Ongoing Research and Community Contributions

3CB bridges a critical gap between theoretical threat models and observed agent behavior, providing a reproducible and extensible testbed. Its modular design supports community-driven expansion—new challenges, protocols, and analytic methods—to track the evolving LLM threat landscape. The benchmark’s approach, rooted in enterprise ATT&CK coverage and strict automation, stands distinct from prior red-teaming or surface-level probes.

The 3CB authors explicitly invite contributions toward comprehensive ATT&CK sub-technique coverage, enhanced evaluation suites, and deeper analytic tools, advocating for a federated, empirically grounded approach to dual-use AI governance (Anurin et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Catastrophic Cyber Capabilities Benchmark (3CB).

Catastrophic Cyber Capabilities Benchmark (3CB)

1. Motivation and Threat Model

2. Architecture and Technical Design

3. Evaluation Procedures

4. Scoring Metrics and Analytical Framework

5. Empirical Findings

6. Policy Implications and Future Trajectories

7. Relationship to Ongoing Research and Community Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Catastrophic Cyber Capabilities Benchmark (3CB)

1. Motivation and Threat Model

2. Architecture and Technical Design

3. Evaluation Procedures

4. Scoring Metrics and Analytical Framework

5. Empirical Findings

6. Policy Implications and Future Trajectories

7. Relationship to Ongoing Research and Community Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research