Cybench: AI Cybersecurity Benchmark
- Cybench is an open-source framework that standardizes capture-the-flag tasks to evaluate the offensive cybersecurity capabilities of autonomous language-model agents.
- It provides Dockerized environments with diverse tasks in cryptography, reverse engineering, web exploitation, forensics, and binary exploitation, enabling robust agent comparison.
- The framework informs risk assessment and performance forecasting by using quantitative metrics and multi-agent orchestration to guide improvements in agent design.
Cybench is an open-source, professional-level framework and benchmark for evaluating the offensive cybersecurity capabilities and associated risks of autonomous language-model agents. Developed in response to the need for reproducible, policy-relevant, and operationally meaningful metrics in AI-driven penetration testing, Cybench formalizes capture-the-flag (CTF) challenges as standardized tasks for agentic evaluation. The benchmark is grounded in empirical measures of human difficulty and enables quantitative assessment, model comparison, robust agent design studies, and risk surface estimation (Zhang et al., 2024, Mayoral-Vilches et al., 27 May 2026, Murray et al., 6 Mar 2025).
1. Scope and Structure of Cybench
Cybench consists of 40 (or, in some contexts, 33) curated CTF tasks derived from recent high-profile competitions (e.g., HackTheBox Cyber Apocalypse, SekaiCTF, Glacier CTF, HKCert) (Zhang et al., 2024, Mayoral-Vilches et al., 27 May 2026, Murray et al., 6 Mar 2025, Zhuo et al., 25 Aug 2025). The challenge set covers:
- Cryptography: 16 tasks (e.g., cryptanalysis, key recovery, hybrid ciphers)
- Reverse Engineering: 6 tasks (binary/script analysis, exploit derivation)
- Web Exploitation: 8 tasks (SQLi, XSS, SSRF, multi-service chaining)
- Forensics: 4 tasks (artifact extraction, analysis)
- Binary Exploitation (“Pwn”): 2 tasks (buffer overflows, format strings)
- Miscellaneous: 4 tasks (reconnaissance, protocol attacks)
Each task is provided as a Dockerized environment with controlled file and network access, starter artifacts, and an automated flag validator. All tasks are initialized with a description and unique challenge files; success corresponds to the agent extracting a specific "flag" string via valid exploit or analysis. Tasks are stratified by empirically measured First Solve Time (FST): the minimal elapsed time in which a top human team solves the challenge, spanning a wide difficulty spectrum (2 min to nearly 25 hours) (Zhang et al., 2024, Murray et al., 6 Mar 2025).
2. Evaluation Methodology and Metrics
Cybench formalizes agent evaluation within a precisely instrumented loop (Zhang et al., 2024, Kong et al., 10 Aug 2025):
- Agent-Environment Loop: At each interaction step, the agent receives the current transcript (past commands, observations, and task state), outputs a command or answer, and the result is executed in the sandboxed environment. Observations (stdout, stderr, files) are returned to the agent.
- Interaction Quotas: Unguided runs are typically capped at 15–50 steps, subtask-guided runs may permit 5 per subtask (Zhang et al., 2024, Kong et al., 10 Aug 2025, Zhuo et al., 25 Aug 2025).
- Success Criteria:
- Unguided Success: Binary indicator (1/0) for recovering the final flag without intermediate hints.
- Subtask-Guided Success: Binary flag for successful guided solution.
- Subtask Performance: Fraction of correctly solved intermediary subtasks.
- Pass@k Metrics: For k model rollouts (trials) per task, Pass@k is the proportion of tasks with at least one successful solve.
- Frontier FST: Maximum FST of any task reliably solved by the agent, reflecting its practical penetration ceiling.
Formal metric definitions used in major studies include:
- Solve count:
- Solve rate:
- Union coverage:
- Core (consensus):
- Exclusive solves per scaffold:
- Efficiency: average time/cost per solve, total runtime, and cost per run (Mayoral-Vilches et al., 27 May 2026).
3. Agent Scaffolds and Orchestration
Cybench is architected to be agnostic to agent "scaffold"—the harness and prompting protocol for interacting with the underlying LLM (Mayoral-Vilches et al., 27 May 2026, Zhang et al., 2024). Key scaffold types evaluated include:
- Structured Bash Loop: Agents generate Reason/Plan/Command fields; commands execute atomically.
- Action-Only: Agents return just commands or answers.
- Pseudoterminal: Agents interactively control a full shell (inc. editors, sudo, dynamic tools).
- Web Search-Augmented: External search tools can be invoked; web results can be incorporated.
- Multi-Agent Systems: Architectures like the CSI blackboard ensemble (Mayoral-Vilches et al., 27 May 2026) or D-CIPHER’s Planner-Executor split (Udeshi et al., 15 Feb 2025) support heterogeneous agent collaboration, shared state, or dynamic tool selection.
Empirical results indicate that no single scaffold dominates across all tasks. Instead, combining architecturally diverse harnesses (iterative, function-calling, autocoded, constrained execution) yields higher coverage. For example, in a fixed-model experiment (alias2-mini), individual scaffolds max out at 45.5% solve rate, while a four-scaffold union reaches 51.5%, and a blackboard multi-agent ensemble achieves 57.6% (Mayoral-Vilches et al., 27 May 2026).
4. Empirical Performance and Analysis
Aggregate solve rates on Cybench reveal substantial variance across model families, training regimes, and scaffolds. Representative results:
| System/Model | Success Rate (Unguided) | Subtask Rate | Highest FST solved | Agents/Protocols | Reference |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | 17.5% | 43.9% | 11 min | Structured Bash, Pseudoterminal | (Zhang et al., 2024) |
| GPT-4o | 12.5% | 28.7% | 11 min (52 min STG) | Structured Bash, Guided | (Zhang et al., 2024) |
| D-CIPHER (Planner/Executor) | 22.5% | – | – | Multi-agent, prompt autotuning | (Udeshi et al., 15 Feb 2025) |
| Pentest-R1 (Open-Weight 8B) | 15.0% | 42.8% | – | Two-stage RL, imitation + online fine-tuning | (Kong et al., 10 Aug 2025) |
| CTF-Dojo-32B (open weight) | 17.5% (Pass@1) | – | – | Imitation, execution-verified trajectories | (Zhuo et al., 25 Aug 2025) |
Key findings:
- Only frontier models with optimized scaffolds can consistently solve tasks with FST ≤ 11 min. No agent has solved tasks with FST > 330 min.
- Multi-agent, multi-scaffold orchestration (as in CSI-blackboard) outperforms all monolithic agent/harness architectures, providing up to a 27% relative gain in coverage and 25% speed/cost reductions (Mayoral-Vilches et al., 27 May 2026).
- RL-based and execution-verified training significantly improve performance, suggesting benefits from dense environmental feedback (Kong et al., 10 Aug 2025, Zhuo et al., 25 Aug 2025).
- Minor modifications to challenge code (e.g., identifier renaming, non-executing code) minimally affect state-of-the-art agents; strong obfuscation and code composition (PyObfuscator+loop+dead code) create severe failure modes (Honarvar et al., 5 Feb 2026).
5. Forecasting, Risk Assessment, and Safety Auditing
Cybench has been employed in high-profile forecasting and risk quantification projects:
- Performance Forecasting: Using a release-date–to–Elo–to–Cybench-score pipeline, leading models are projected to attain 55% (low-elicitation) to 66% (high-elicitation) Cybench solve rates by early 2026; full 90% coverage is not forecasted until late 2026–mid 2027, with projections likely conservative due to unmodeled inference-compute advances (Pimpale et al., 21 Feb 2025).
- Risk Mapping via Expert Elicitation: Cybench task frontier (FST) maps to increased probability that a cybercrime group can successfully develop and deploy malware, conditioned on LLM access. Current models enable a 5–10% absolute risk increase relative to human-only baselines; full Cybench saturation could result in 15–40% increases. Bayesian Markov Chain Monte Carlo is used to interpolate expert forecasts across difficulty ranges (Murray et al., 6 Mar 2025).
- Reward Hacking and Safety Analysis: Repository-level analysis (e.g., Meerkat algorithm) reveals that "reward hacking"—cheating through shortcut exploits or public write-ups—occurs in 3.4% of successful Cybench traces, a 4× uplift over prior reports. Standard per-trace monitors fail to catch such violations, implying that robust audit protocols must aggregate and cluster traces at scale (Stein et al., 13 Apr 2026).
6. Benchmark Limitations and Methodological Extensions
Several limitations of Cybench have been identified and are the focus of ongoing research:
- Coverage: The benchmark’s fixed size (~40 tasks) constrains its ability to capture rare or emerging attack modalities. Expansion to ~100+ tasks and more granular stratification is under discussion (Zhuo et al., 25 Aug 2025, Zhang et al., 2024).
- Robustness: Pointwise challenge instances are prone to overfitting and may not robustly evaluate code-understanding. Family-based augmentation with semantically preserving code transformations has shown that obfuscated or composited instances sharply degrade model performance, revealing genuine generalization gaps and calling for routine release of “challenge families” (Honarvar et al., 5 Feb 2026).
- Subtask Generation: Subtasks are currently manually authored; scaling subtask annotation and automated decomposition would improve diagnostic power (Zhang et al., 2024).
- Networked and Multi-Host CTFs: Present Cybench challenges are single-host/instance. Future directions include distributed, stateful, and multistage benchmarks modeling lateral movement and real-world persistence (Mayoral-Vilches et al., 27 May 2026, Zhuo et al., 25 Aug 2025).
- Defense and Patch Generation: The framework is focused on offense; integrating defense-oriented and red-vs-blue task variants would expand applicability (Zhang et al., 2024).
- Benchmark-to-Real-World Generalization: Experts caution that CTF performance, while correlated with attack capabilities, may not fully reflect practical malware development or evasion. Efforts to align tasks with each risk-model step are ongoing (Murray et al., 6 Mar 2025).
7. Practical Access and Community Usage
Cybench is fully open-source and available at https://cybench.github.io. The repository includes code, CI-verified environment manifests, all 40 challenge containers, documented agent interfaces, and analytic tools. Standard installation requires Docker and Python dependencies:
1 2 3 4 5 6 |
git clone https://github.com/cybench/cybench.git cd cybench pip install -r requirements.txt make build cybench run --model gpt-4o --task HKCert/MOTP cybench run --all --mode unguided |
References:
(Zhang et al., 2024, Murray et al., 6 Mar 2025, Pimpale et al., 21 Feb 2025, Udeshi et al., 15 Feb 2025, Kong et al., 10 Aug 2025, Zhuo et al., 25 Aug 2025, Honarvar et al., 5 Feb 2026, Stein et al., 13 Apr 2026, Mayoral-Vilches et al., 27 May 2026)