NYU CTF Bench: LLM Security Benchmark

Updated 10 June 2026

NYU CTF Bench is an open-source benchmark suite designed to evaluate LLMs and autonomous systems through realistic, containerized offensive security tasks.
It integrates a diverse range of CTF challenges, spanning cryptography, forensics, binary exploitation, reverse engineering, web security, and miscellaneous tasks in a reproducible Docker environment.
The benchmark supports advanced agent architectures with tool-augmented workflows and detailed metrics, enabling precise measurement of LLM performance and cybersecurity automation.

NYU CTF Bench is an open-source, university-level benchmark suite designed to rigorously evaluate LLMs and autonomous agentic systems on realistic offensive security tasks. Originating from the NYU CSAW CTF competitions (2017–2023), it comprises a diverse set of fully containerized jeopardy-style Capture-The-Flag (CTF) challenges across six security domains: cryptography, forensics, binary exploitation (“pwn”), reverse engineering, web exploitation, and miscellaneous tasks. Each challenge environment is engineered for reproducibility and deep tool integration, enabling automated, end-to-end solution pipelines for research in LLM-driven cybersecurity automation (Shao et al., 2024).

1. Dataset Structure and Challenge Composition

NYU CTF Bench consists of 200 manually validated CTF tasks, each encapsulated in a Docker container to enforce environmental consistency and reproducibility. The category and task distributions are as follows (Shao et al., 2024, Abramovich et al., 2024):

Category	# Challenges
Cryptography	53
Forensics	15
Binary Exploitation	38
Reverse Engineering	51
Web Security	19
Miscellaneous	24

Each challenge is packaged with:

Challenge metadata: name, text description, difficulty score (1–500), ground truth flag, and provisioning instructions.
Attached assets: source/binary files, relevant data, or client/server code.
Configuration: Docker Compose setups (where relevant), networking, and tool whitelists.
Pre-installed toolset: e.g., gdb, radare2, pwntools, binwalk, sagemath, and category-specific utilities.

Difficulty covers qualifier and final rounds, reflecting the authentic scoring distribution from NYU CSAW events. All tasks were validated for functional integrity, solvability, and reproducibility under standardized Ubuntu (and, in extended studies, Kali) environments (Shao et al., 2024, Merves et al., 18 Apr 2026).

2. Evaluation Protocol and Metrics

Solving a NYU CTF Bench challenge requires the agent to recover the hidden “flag” string (e.g., flag{…}) by orchestrating tool-aided analysis and exploitation workflows inside a resource-constrained Linux container.

Primary Metrics

% Solved / SuccessRate / Pass@1: Fraction of challenges for which the correct flag is submitted in a single trajectory,

$\mathrm{SuccessRate} = \frac{N_{\mathrm{solved}}}{N_{\mathrm{total}}} \times 100\%$

Average cost per solve ($) / CostPerSolve: Sum of API-token costs across all instances, divided by challenges solved.
Category-wise solve rates: Disaggregated performance for each CTF discipline.
Pass@k: Used in some studies to aggregate over multiple sampled rollouts (k trajectories per challenge).

Secondary Metrics

CTF Competency Index (CCI): Measures partial alignment to gold-standard human solutions, aggregating subskill correctness via LLM-based judge agents (Shao et al., 5 Aug 2025).
Efficiency: Solve time, resource footprint, and round budget usage.

Automated Framework

The benchmark’s orchestration system supports function calling, fine-grained logging, and tool invocation (e.g., run_command, createfile, disassemble_function). Agent–environment interactions are recorded for post-hoc analysis and replay (Shao et al., 2024, Abramovich et al., 2024).

3. Agent Architectures, Toolchains, and Evaluation Paradigms

NYU CTF Bench is designed for agentic autonomy, supporting both single-agent and multi-agent approaches.

Tool-augmented LLM Agents

Integrations such as EnIGMA (Abramovich et al., 2024) and D-CIPHER (Udeshi et al., 15 Feb 2025) employ advanced tool APIs:

Interactive Agent Tools: (gdb debugger REPL, pwntools for server interactions) enabling symbolic execution and exploit scripting.
Modular multi-agent stacks: Planner–Executor paradigms, in which a central planning agent delegates to specialized execution agents, optionally mediated by auto-prompting or retrieval-augmented generation modules.
Function-calling interface: Exposes a set of tool APIs available to agents, varying per challenge discipline and container (Shao et al., 2024).

Synthetic and Knowledge-Infused Approaches

Cyber-Zero synthesizes agent trajectories directly from human writeups, enabling offline training without access to execution environments (Zhuo et al., 29 Jul 2025).
CRAKEN augments the planner–executor framework with a recursive, knowledge-enriched retrieval-augmented generation pipeline over a CTF writeup database, improving performance by injecting contextualized hints at execution-time (Shao et al., 21 May 2025).

Hyperparameter and Orchestration Optimization

Extensive sweeps revealed that performance peaks under high-temperature decoding (T ≈ 1.0), wide top- $p$ , and mid-range context length (4096 tokens), with tight feedback loops between planner and executors yielding substantial gains (Shao et al., 5 Aug 2025).

4. Quantitative Results and Comparative Analysis

Larger, instruction-tuned proprietary LLMs achieve the strongest results, but open-weight models have closed much of the performance gap through training on synthetic or execution-verified trajectories.

Model / Configuration	Pass@1 (%) / % Solved	Context	Reference
Claude 4.5 Opus	59.0	Kali / Multi-agent stack	(Merves et al., 18 Apr 2026)
Gemini 3 Pro	52.0	Kali / Multi-agent stack	(Merves et al., 18 Apr 2026)
D-CIPHER (Claude 3.5)	19.0 – 22.0	Ubuntu / Multi-agent stack	(Udeshi et al., 15 Feb 2025)
CRAKEN + Graph-RAG	22.0	Ubuntu / Knowledge-based exec	(Shao et al., 21 May 2025)
Cyber-Zero-32B (open)	13.5	EnIGMA+ scaffold, fine-tuned	(Zhuo et al., 29 Jul 2025)
EnIGMA (Claude 3.5 Sonnet)	13.5	Ubuntu / Tool-augmented agent	(Abramovich et al., 2024)
Open-weight zero-shot (max)	6.2	DeepSeek-V3-0324	(Zhuo et al., 25 Aug 2025)

Absolute performance is highly sensitive to execution environment and agent design. For example, extending containerization from Ubuntu to Kali Linux with full pentest tooling results in a +9.5% solve-rate gain for the same model (Merves et al., 18 Apr 2026).

EnIGMA’s introduction of interactive debugging and server interfaces tripled solve rates over initial baselines. D-CIPHER’s multi-agent synergy and feedback loops produced further improvements. CRAKEN’s knowledge-grounded, self-reflective retrieval added another +3 percentage points, particularly enhancing reverse engineering and web categories. Cyber-Zero and CTF-Dojo demonstrated the effectiveness of training open-weight models on synthetic and execution-verified trajectories, lifting low-parameter models into the double-digit pass@1 regime (Zhuo et al., 29 Jul 2025, Zhuo et al., 25 Aug 2025).

5. Subsets, Variant Benchmarks, and Evaluation Innovations

Several lightweight or specialized derivatives have been created from NYU CTF Bench:

CTFTiny: A stratified subset of 50 challenges balancing difficulty and domain coverage, supporting rapid evaluation and hyperparameter tuning (Shao et al., 5 Aug 2025).
CTFJudge: LLM-based judging framework for granular, multi-dimensional evaluation of reasoning and alignment using the CTF Competency Index (CCI), facilitating deeper diagnostic analysis of agent trajectories (Shao et al., 5 Aug 2025).

These variants have clarified the sources of performance variance across tasks and architectures, and enabled reproducible rapid experimentation within resource constraints.

6. Limitations, Observed Failure Modes, and Future Directions

Current benchmark limitations include category imbalance (cryptography and reverse engineering are overrepresented relative to forensics and web), a focus on Dockerized rather than heterogeneous, real-world environments, and the absence of dynamic, Attack-Defense, or incident-response tasks (Shao et al., 2024).

Common failure modes observed:

Stagnation: Agents hit context limits or repeated unproductive strategies (especially in “pwn” and web).
Tooling blind spots: Lack of HTTP, advanced symbolic execution, or memory-augmented context manager.
Soliloquizing and solution leakage: LLMs sometimes hallucinate tool outputs or reproduce memorized flags, impacting result fidelity (notably seen in Claude 3.5 Sonnet) (Abramovich et al., 2024).

Research directions emerging from benchmark use include:

Expansion to other CTF series beyond CSAW, including attack-defense patterns (Shao et al., 2024).
Systematic MITRE ATT&CK mapping for adversary emulation evaluation (Udeshi et al., 15 Feb 2025).
Hybrid knowledge architectures, memory augmentation, and adaptive budgeting across exploration and exploitation tasks (Shao et al., 21 May 2025).
Iterative fine-tuning and reinforcement learning from tool feedback signals (Shao et al., 2024).

7. Resources and Reproducibility

NYU CTF Bench is maintained as an open-source resource with public repositories for both the dataset and supporting agent automation frameworks:

Dataset and challenge artifacts: https://github.com/NYU-LLM-CTF/NYU_CTF_Bench
Automation playground: https://github.com/NYU-LLM-CTF/llm_ctf_automation
Related frameworks: EnIGMA (Abramovich et al., 2024), CRAKEN (Shao et al., 21 May 2025), Cyber-Zero (Zhuo et al., 29 Jul 2025), CTF-Dojo (Zhuo et al., 25 Aug 2025).

These platforms have underpinned reproducibility, cross-benchmark comparison, and transparent methodological advances in LLM-driven offensive security research.